Best Local LLMs - Apr 2026
Posted by rm-rf-rm@reddit | LocalLLaMA | View on Reddit | 362 comments
We're back with another Best Local LLMs Megathread!
We have continued feasting in the months since the previous thread with the much anticipated release of the Qwen3.5 and Gemma4 series. If that wasn't enough, we are having some scarcely believable moments with GLM-5.1 boasting SOTA level performance, Minimax-M2.7 being the accessible Sonnet at home, PrismML Bonsai 1-bit models that actually work etc. Tell us what your favorites are right now!
The standard spiel:
Share what you are running right now and why. Given the nature of the beast in evaluating LLMs (untrustworthiness of benchmarks, immature tooling, intrinsic stochasticity), please be as detailed as possible in describing your setup, nature of your usage (how much, personal/professional use), tools/frameworks/prompts etc.
Rules
- Only open weights models
Please thread your responses in the top level comments for each Application below to enable readability
Applications
- General: Includes practical guidance, how to, encyclopedic QnA, search engine replacement/augmentation
- Agentic/Agentic Coding/Tool Use/Coding
- Creative Writing/RP
- Speciality
If a category is missing, please create a top level comment under the Speciality comment
Notes
Useful breakdown of how folk are using LLMs: [)
Bonus points if you breakdown/classify your recommendation by model memory footprint: (you can and should be using multiple models in each size range for different tasks)
- Unlimited: >128GB VRAM
- XL: 64 to 128GB VRAM
- L: 32 to 64GB VRAM
- M: 8 to 32GB VRAM
- S: <8GB VRAM
kelembu@reddit
Uncensored answers, Ryzen 5800x, 32GB RAM, 16GB VRAM AMD 6800 Gpu
rm-rf-rm@reddit (OP)
Agentic/Agentic Coding/Tool Use/Coding
awitod@reddit
I am using unsloth/Qwen3.5-35B-A3B-Q5_K_XL and getting excellent results. I am using it over 27b for memory management and speed because I am testing a config that works without any cloud services and seeing how much quality I can get if I load everything at once.
I have ASR, TTS, text2Image, image2image, LLM with vision and embeddings simultaneously.
System: 96 GB RAM, 56 GB VRAM total (RTX 5090 + RTX 4090)
unsloth/Qwen3.5-35B-A3B-Q5_K_XL
mmproj-F16.gguf
llama.cpp config:
ctx-size=262144
threads=16
parallel=5
cache-ram=8192
n-gpu-layers=999
kv-unified=1
jinja=1
cont-batching=1
Using unsloth guide rec's for inference settings
temperature=0.7
top_p=0.8
top_k=20
min_p=0.0
presence_penalty=1.5
repetition_penalty=1.0
thinking toggle via chat_template_kwargs.enable_thinking (off in most but not all agents)
parallel_tool_calls=true <-- VERY IMPORTANT FOR OUR USE CASES
Image stack models/config:
diffusion: flux-2-klein-4b-Q4_K_S.gguf
VAE: full_encoder_small_decoder.safetensors
text model: Qwen3-4B-Q4_K_M.gguf
defaults: steps=4, cfg_scale=1.0, strength=0.75
Other local models in same runtime:
Embeddings: microsoft/harrier-oss-v1-0.6b
ASR: Qwen/Qwen3-ASR-0.6B
TTS: microsoft/VibeVoice-1.5B + Qwen/Qwen2.5-1.5B tokenizer
Substantial-Flow9244@reddit
Do you mind if I ask what your average power draw looks like?
nacnud_uk@reddit
May I ask, is it okay to mix AMD and NVIDIA in the one system. I've a 3080 and a slightly older AMD card. The 3080 is 10gb. I think the other is 8gb.
awitod@reddit
Not at the same time for the same model runner. You could theoretically use both cards in the same machine at once but you would have to give them distinct workloads, e.g. one for LLM, one for image gen.
...but it sounds like a fun bad idea to me. :D
nacnud_uk@reddit
Ah, right. Thank you. :) I was wondering if it was seamless or not. What you say makes good sense. Thanks :)
Far-Low-4705@reddit
is this a setting in llama.cpp or something?
Caffdy@reddit
what are you using all of those simultaneously?
awitod@reddit
It goes in the chat request json. llama.cpp/docs/function-calling.md at master · ggml-org/llama.cpp
ScoreUnique@reddit
What harness do you use it on? Having a hard time having something that babysits it enough while not losing track or getting in infinite doomthinking.
Crinkez@reddit
Why 3.5 and not 3.6?
awitod@reddit
Because it wasn't out yet
rm-rf-rm@reddit (OP)
what are you using to run ASR and TTS?
awitod@reddit
rkoy1234@reddit
I am also chasing the seamless speech to speech endgame, though using omnivoice for tts and whisper+vad. how are you finding qwen asr? no need for a separate VAD?
awitod@reddit
What I am doing is pretty basic, but works great so far. One reason I picked it was because of the existence of qwen3-asr-toolkit but I have not gotten around to exploring features yet
rm-rf-rm@reddit (OP)
im asking about the engine youre using to run it... pytorch?
awitod@reddit
Sorry, I was trying to edit and get the links to the libs.
qwen-asr · PyPI
vibevoice · PyPI
Far-Low-4705@reddit
qwen3 ASR was just added to llama.cpp!
RaptorF22@reddit
How do the 2 rtx cards combine their vram? I thought that was only possible with 3090s.
awitod@reddit
It’s well supported by the nvidia drivers and mixing devices is pretty common.
Many things allow you to configure a specific GPU, all GPUs, auto or cpu only and I have been tweaking that for each thing as I go to get the most out of what I have.
It’s kind of like packing a car 😆
b0tm0de@reddit
hello. full encoder small decoder.safetensersor. i downloaded that from official repo. put it in vae folder for comfyui and its just giving error. do you know how that working for u? comfyui v18.5
awitod@reddit
I’m not using comfy, just stablediffuision.cpp which we build with the image.
puru991@reddit
What t/s are you getting?
awitod@reddit
Here is some normal output.
Total_Activity_7550@reddit
Qwen3.5-27B. Nothing else that fits into 2xRTX 3090 works for my project. I use Qwen Code.
I also have my personal written todo webapp, it has MCP server. Gemma 31B is on par with Qwen3.5-27B.
Far-Low-4705@reddit
since u have two 3090's, u should try the new `-sm tensor` flag, it enables tensor parallelism.
It is buggy, and might not be faster for qwen 3.5 yet, only for older models, but definitely keep an eye on it, it will likely get much better in the future
Warrenio@reddit
You sound knowledgeable, so I'll ask you, what is the difference between --split-mode row and --split-mode tensor?
Far-Low-4705@reddit
afaik, split mode row splits the model by rows instead of sequential layers, and it is a psuedo implementation of tensor parallelism afaik.
It's never worked for me and always output garbage, so i wouldnt use it if i were you.
-sm tensor uses real tensor parallelism, and if u have a multi gpu setup with good hardware and good inter-gpu communication speeds it will be faster than -sm layer. it basicially splits the computation across all gpu's or compute nodes.
However it is still early in development and might be buggy or not fully optimized, especially on non-nvidia hardware
crantob@reddit
Most informative post and I am first upvote. Yep reddit it is.
Total_Activity_7550@reddit
Token generation increased, but prompt processing decreased, as noted by llama.cpp developers. For my use case this isn't beneficial.
Far-Low-4705@reddit
it is still very much experimental, but you should keep an eye on it!
It will almost certianly make a difference for you since you are pretty much the target hardware
Unfortunately for me it makes no difference, but im running on two AMD MI50's so much older and not nvidia.
Ok-Internal9317@reddit
hi, is your pp slow at the start and faster at the end when things build up or its the opposite?
Total_Activity_7550@reddit
Sure it slows down with context builtup.
dinerburgeryum@reddit
Out of curiosity have you tried split mode tensor yet? I’m having a bear of a time getting it working with 3090+A4000, but 2x 3090 should work way better.
viperx7@reddit
i have a 4090+3090 and it crashes for me every-time the context exceeds 10k. feels like some kind of bug
Total_Activity_7550@reddit
Token generation increased, but prompt processing decreased, as noted by llama.cpp developers. For my use case this isn't beneficial.
CBW1255@reddit
Can you post your llama.cpp config for using that exact quant you are using? Thanks.
Total_Activity_7550@reddit
Updated parent comment.
Novel_Law4469@reddit
how do you manage to combine 2 GPUs together ? is it like a resource pool or something ?
Alternative-Cat-1347@reddit
I have modest specs but very happy this is even an option for me! the fact it runs well will never not blow my mind. I spent several days tweaking and benchmarking it to get to the best config.
I'm using it for coding with pi.dev and opencode but not on the same host as it really completely exhausts my PC running the LLM, so I code from my laptop.
I use llama.cpp and Qwen3.6-35B-A3B quants from unsloth, I run it in one of two modes:
Specs: AMD Ryzen 7 3700X, 32GB DDR4-RAM 2666 MT/s, 8GB Geforce RTX 3070 Ti, +64GB of paged pool on my fastest ssd (helps prevent total system lockup)
I ran multiple tasks that passed 200k context length and LLM was still coherent and functional! (can add features, fix bugs, etc)
mrtime777@reddit
cyankiwi/MiniMax-M2.7-AWQ-4bit (or cyankiwi/MiniMax-M2.5-AWQ-4bit) on 2xGB10 cluster..
Waarheid@reddit
Have you tried skills instead of tools? (i.e. just giving them a bash tool, and replacing all your tools with CLI commands that are each described in a markdown file)
https://mariozechner.at/posts/2025-11-02-what-if-you-dont-need-mcp/
I've found pretty small models are and to use skills effectively this way, particularly via the pi.dev agent
eikenberry@reddit
CLI tools + markdown describing their usage is pretty much an ad-hoc version of MCP.. why write custom cli commands + docs describing their use instead of writing your own MCP servers?
Waarheid@reddit
Why not do the more complicated and heavier implementation instead of simple and lightweight implementation? Great question...
eikenberry@reddit
Writing a new CLI command doesn't seem much easier than writing a new MCP server. I don't see how the CLI+docs is lighter weight than a MCP server? I guess it is the formality level... A small shell script command + quick docs would definitely be lighter weight where the MCP server requires more up front. Seems like an MCP server that simply provided a way to add shell-scripts + a quick doc to them would hit a similar level but then I guess that is pretty much what agent/harnesses are for and thus they already provide this.
TLDR; I think I've circled back around to agree with you, at least in the short run. IMO the verdict is out long term as I do think the MCP server idea has some merit.
Waarheid@reddit
MCPs became popular before harnesses that lived in the shell got big. Having a tool to run shell commands in those harnesses means just writing a shell command and an MD file is the simplest way to go, of course assuming your harness makes use of that (e.g. how pi does)
_derpiii_@reddit
Memory manager?
Zc5Gwu@reddit
+1 for minimax 2.7. I'm running with the following command on strix 128gb. Works well for agentic coding if your patient (do some laundry in the meantime). At 30k context getting about 16t/s and 50t/s prefill.
sn2006gy@reddit
works well as agentic? it's TERRIBLE lol.
BUT.. if you don't mind wasting time/electricity go for it. DON'T USE IT WITH AN API IF PEOPLE ARE READING THIS. YOUR COSTS WILL GO TO THE MOON
Local-Cartoonist3723@reddit
Didn’t get a chance to try this yet — you’re happy with it then? Any writeups?
Blues520@reddit
I'm using qwen3-coder-next on 48gb vram. Using an unsloth quant with 100k context.
Running in llamacpp with opencode.
-c 100000--flash-attn on--n-gpu-layers 999--n-cpu-moe 24--jinja--temp 1.0--top-p 0.95--min-p 0.01--top-k 40It works well enough but sometimes generates less than optimal code. I end up having to adjust the prompt and restarting the session quite often but it's fast enough that this is not a huge problem. It also has a very clinical feel so it's not amazing for UI work. I recently ran gemma4 for a UI task and it had a much better feel so probably going to alternate between qwen-coder and gemma4.
crantob@reddit
coder-next is an interesting one to pit vs 27B. What were your experiences with 3.5 27B?
Blues520@reddit
I keep going back to coder-next personally. I never found much success with 3.5 27b but I am running 3.6 27b which I alternate with. So far it's good but I alternate between coder-next and gemma as well. Hopefully the next coder will be good enough that I can retire coder-next.
Raredisarray@reddit
I’m in the same boat as you
Blues520@reddit
I use 3.6 27b now btw and only run coder-next occasionally. I found the dense model more intelligent after all.
sanjxz54@reddit
I liked how coder-next worked for me in Claude Code; apex-i-quality quant on 12gb vram (5070) with 56gb ram offloading 200k context runs somewhat fast (150t/s prefill, 20 t/s gen on llama cpp). Worked best for me to ask a free tier frontier model on how to implement something, and pass that to qwen for actual work.
But i recently tried 122b-a10b apex-i-mini and it feels better than coder next. if i had to compare, id say coder next with mcps for web search in claude code is about sonnet 3.5 from cursor v0.4 (or smth) era, while 122b one is closer to sonnet 4 level. speed is 250 t/s prefill and 10 t/s gen. Using those settings in ik llama cpp (self compiled via MSVC):
-c 230000 --fit --peg --cache-type-k q8_0 --cache-type-v q6_0 --k-cache-hadamard --v-cache-hadamard --fit-margin 2048 -np 1 -fa on -mla 3 -t 16 -tb 16 --merge-qkv -b 2048 -ub 1024 --no-mmap --jinja --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --reasoning-budget -1 --repeat-penalty 1.0 --presence-penalty 0.0 --alias "qwen-3-apex" --port 8080 (for 122b. for coder next i used temp 1.0 and top k 40, and standard llama)
Really wanna upgrade to 96+ gb ram and try m2.7 in mini quant, sounds great on paper.
Better-Monk8121@reddit
а, ну давай
Terminator857@reddit
I tested various models including gemma4 q8. Qwen 3.5 122b q4 beat them all in my tests. Wasn't even close.
Desther@reddit
What token speed?
Embarrassed_Elk_4733@reddit
try qwen3.5 27b and wait for ur feedbacks!
SpicyWangz@reddit
That would’ve too slow on a strix
Safe-Buffalo-4408@reddit
Using it on a Strix. I've tested a bunch of different models but I always go back to the qwen 3.5 27B due to quality. As faster models generate poorer output it has to spend time on fixing that, 27B won't create much errors. So yes it's slower but with Way better results. Worth it in my usecase with agentic coding and also using it as a personal assistant.
Caffdy@reddit
have you tried 3.6 27B that released today?
SpicyWangz@reddit
It performs better than the 122B for you or minimax m2.7?
Safe-Buffalo-4408@reddit
I've used 122B, but not minimax. But 27B performs better than 122B for be, it gets very obvious in Agent Zero between the two. 122B creates broken code and strange never ending loops, 27B is a work horse that slowly but firmly creates fully working software and seldom failing tool calls ans as good as never get in to loops longer than 2-3 iterations before it breaks out of them and continue it's work.
I fully recommend Qwen 3.5 27B if you are ok with slower speed but resulting in high quality code and tool calls.
rm-rf-rm@reddit (OP)
did you test it against qwen3.5 27b?
valtor2@reddit
Problem is, strix halo is very slow on dense models. 27b has been hard to use for me because of that
Terminator857@reddit
No, but I've seen others have and reported that gemma 4 wins.
rm-rf-rm@reddit (OP)
not universally
cafedude@reddit
I found that Qwen 3.5 122b would tell me stuff like tests were passing when they really weren't, or some failure was pre-existing and not a result of it's changes (when it was). I find Qwen3-coder-next to be better. Also on a strix halo system.
AlwaysLateToThaParty@reddit
I've been pretty pleasantly surprised with Qwen3.5 122b (heretic mxfp4). I used next before, but I found it wasn't as consistent as I would have liked. I'm finding the later model pretty consistently sorts out complex coding tasks.
But maybe that's just my workflows. Primarily python recently.
orzechod@reddit
model: unsloth/Qwen3.6-35B-A3B-Q6_K
framework: llama.cpp + llama-server Vulkan
llama.cpp config:
inference settings: defaults
hardware: Ryzen HX 370 with 96GB of RAM (minisforum x1 pro)
coding agent: late
notes:
I don't have the memory bandwidth for a dense model like Qwen3.6-27B but 35B-A3B has been working out really well for me. I get a touch of overthinking but the model seems to work itself out eventually, and my inference speed (~22 tok/s using Vulkan) is fast enough that I don't really mind. using an orchestrator like Late seemed to be the missing piece for my stack though. prior to it I was using goose as my agent orchestrator and was having a ton of issues with tool calls silently failing. however Late is separating the architect model from worker models is doing wonders for me; I am able to make good progress on a variety of Javascript/Typescript, Python, and Lisp projects with a minimum of fuss now.
baliord@reddit
I use several models for different things; I run almost exclusively llama.cpp, and I use llama-swap to sit in front of my llama-server instances, providing around 32 different model choices. (I test different models regularly.) My go-to has been MoE models since \~GLM-4.6, as I can split them between GPU and CPU, and they handle it much better than dense models.
Right now I'm using GLM-5.1 (unsloth/GLM-5.1-GGUF)at 3 bit quantization, generally in non-reasoning mode, for creative writing. It's also my go-to for anything where I want to talk, but not do tool calls. At 10 t/s generation local, it's just way too slow at that, but it's human-speed for conversation or character-driven stories. It also picks up on character-definitions better than any other model short of Opus. (I've also used GLM-5.1 in the cloud for OpenClaw historically, because it's Opus-level smart, and because I want my agent to adopt the persona that is defined for it. These days I'm trying to use Qwen-122B local more consistently for OpenClaw, unless I need the smarts.)
For agentic use, Qwen 3.5-122B works surprisingly well, although not much of a 'personality'. I've run it at q4 (fully in GPU RAM) at \~50 t/s generation. I haven't needed to push up to q8 for it, and if I need much smarter I go cloud. Now the specific model I'm using there is HauhauCS/Qwen3.5-122B-A10B-Uncensored-HauhauCS-Aggressive. I also use that for image analysis, tagging and processing using mmproj-f16. The q4 isn't as good at image analysis as the q8, though. If I had to pick a model to stick in pure GPU RAM semi-permanently, it'd probably be this one, although I'd bump up to Q8 and let some of it sit in RAM.
I have an embedding model, but I don't really use it that much anymore. I was using Qwen3-Embedding-8B-Q8_0.gguf and a smaller Qwen reranker. I need to get back to this.
My system is a ASUS ESC4000A-E12 with a 32 core EPYC and 384GB of DDR5 RAM, 2xL40S for 96GB of GPU RAM; it sits in a SysRacks rack in my garage.
My basic config for each llama.cpp llama-server call in the llama-swap config expands to:
For non-reasoning, I add:
I customize the
-c {context-length}per model, because if you don't manually set a context length,--fit onwill shrink your context to nothing, in order to fit the models, before it goes to RAM. :rage:I also have a 'limited reasoning' for use cases where I want it to do reasoning, but I want it to not waste all it's time doing it. So I'll limit it to \~2048 tokens of reasoning, and leaving
enable_thinkingalone. (E.g. using the above 50 t/s on Qwen3.5-122B@Q4, that's about 41 seconds of reasoning.)Hope that helps!
Viperus@reddit
I tried spinning Qwen 3.5 122B with Q6 and 64k context on a Jetson Thor 128GB but this doesn't work well without some kind of orchestration, it seems.
When you say "For agentic use, Qwen 3.5-122B", how to do orchestrate agents?
baliord@reddit
Not sure what you mean by 'some kind of orchestration'? I'm running llama.cpp which offers an http endpoint, which I then configure in various other tools (Silly Tavern, and OpenClaw are the two main ones I use).
The thing that makes Qwen3.5-122B good for agent use is its tool-calling smarts. You do need an agent framework to use it, of course. I like OpenClaw, and use it extensively, but I've heard good things about Hermes, and others.
Really the question is what do you want to do with it? What need do you have that you'd like addressed?
Viperus@reddit
I ran a test just this weekend to convert the frontend from angularJS to blazor server, to do it one view at a time, test etc. but it failed horribly, mostly due to low context and too much compacting.
So, I'm trying to figure if I can split the work with subagents etc. Can it even be done with 64k context, or use Q8 or even Q6 context with 128k context, or should I use a smaller model etc.
Nindaleth@reddit
Would this option be of any help to you?
baliord@reddit
Interesting! I hadn't seen that, for some reason. Yes, that will help a lot!
danigoncalves@reddit
That's a beast what you have there :D
baliord@reddit
It is! My wife calls my ML server my mid-life crisis, 'cause it cost about as much as a car, but at least I'm not out there trying to race it. And it sounds like the blower on a race car when doing training or inference. One person described it as 'an HVAC system with opinions'.
More seriously, I leaned heavily into ML about two years ago, and wanted to have a system that could handle really strong local models for dev and exploration. Coincidentally I had a company that was happy to give me resources to lean into 'AI', and a local bespoke system builder that I had a good relationship with.
danigoncalves@reddit
LOL
rm-rf-rm@reddit (OP)
Excellent post! Would you mind sharing your llama-swap YAML? There's so little references out there and would love to compare and improve mine
baliord@reddit
Sure; it's still crufty, because I comment out stuff that's not currently being used instead of pruning and deleting, and things like that, but it might be valuable for other reasons.
I recently tried to set it up so that I could easily switch between testing ik_llama.cpp versus llama.cpp, but because the parameters are not entirely compatible, I had a lot of weird tricks I had to do to make it mostly work. And then I decided not to use it. 🤣
I'm also currently trying out `
-c 0` instead of manually setting context length per-entry, because it's annoying me to have to look up the context lengths for each model.Anyway, I tossed it up on a gist, with minimal editing. Let me know what you think, and it's okay to say, 'Oh god, don't do it like that...do it like this, instead!' :)
No-Statement-0001@reddit
Thanks for sharing that. I love getting a peek at other peoples configs. llama-swap is like a box of legos and seeing what people craft with the parts is really fun.
rm-rf-rm@reddit (OP)
thanks! it made me realize I should use macros more. Right now im just using it as a simple 1 to 1 look up table rather than groups of params
El_90@reddit
Strix halo, 128GB (I can squeeze in 92GB models currently, so rated **XL**)
Roocode in architect mode - Qwen3.5-122B-A10B-Q5_K_M (91GB), in the region of 7t/s
Roocode in coding mode - Qwen3.5-27B-Q5_K_M (20GB), in the region of 12t/s
Sorry I don't have deep testing, but I tried 5-10 other models and there was always lots of back and forth with more changes, errors, mistakes, but with these models I don't feel that, so I just stuck with them
I find 122B slightly better in architect mode, more diagrams, more thorough talking through the requirement, though maybe that's my own bias.
CSEliot@reddit
Why not Qwen 3 Coder Next?
Pretend_Engineer5951@reddit
Do you think it handles well? I tried several times and became very disappointed by results. The newer Qwen 3.6-35B-A3B performs better and faster
CSEliot@reddit
Mixed results but nothing has blown me away yet
Hobbster@reddit
I don't know Roocode but 7T/s sounds awfully slow on a Strix Halo. I usually run an Unsloth Qwen 3.5 122B A10B Q6_K around 22.5T/s and around 18T/s with 100k context, Bartowski 122B A10B Q6_K_L @ 21.6T/s and 17.3 with 100k context. Llama-server. Both models are somewhat larger, so your Q5 should be faster than that? A lot of performance ready to be discovered in your machine.
CSEliot@reddit
There's Strix Halo - Mobile and Strix Halo - Desktop. Depends on the machine. u/El_90 is probably running a Asus ROG Z13
PANIC_EXCEPTION@reddit
L: Qwen-Coder-Next is the only somewhat-consistent model I've tested so far that runs on M1 Max at 4bit MLX that can work on large codebases.
PANIC_EXCEPTION@reddit
S: For FIM, still nothing surpasses Qwen2.5-Coder for lightweight, fast autocomplete. Unless Gemma 4 FIM is fixed and it stops trying to generate new lines only instead of completing the current line.
dondiegorivera@reddit
L:
Agentic tasks:
I run Unsloth's MiniMax-M2.7-UD-Q5_K_S on a dedicated headless Ubuntu node with Ryzen9 5950, 2xRTX3090, 128GB DDR4 with the following params:
CUDA_VISIBLE_DEVICES=0,1 ./llama-server \-m /data/models/MiniMax-M2.7-UD-Q5_K_S/MiniMax-M2.7-UD-Q5_K_S-00001-of-00005.gguf \--host0.0.0.0\--port 8080 \--alias MiniMax-M2.7-Q5KS-64k \--ctx-size 65536 \--parallel 1 \--n-gpu-layers auto \--n-cpu-moe 54 \--split-mode layer \--tensor-split 5,1 \--cache-type-k q4_0 \--cache-type-v q4_0 \--batch-size 4096 \--ubatch-size 2048 \--threads 16 \--threads-batch 16 \--flash-attn auto \--reasoning off \--reasoning-format none \--reasoning-budget 0 \--mlockPrefill / prompt speed: about 11.2 tok/s
Decode speed: about 8.5 tok/s
The GPUs are power limited to 280W.
youcloudsofdoom@reddit
S - 8GB
I'm getting great mileage out of Qwen3.5-35B-A3B-UD-Q4_K_L. With this I'm squeezing around 600 p/p and 30 t/s out of my RTX 4070 Laptop (!) edition, 8GB VRAM. Very usable, and the competency on single coding tasks has been very good so far. I'm currently experimenting with using this in a local Hermes set up, but it's early days yet.
Here's my llama.cpp settings, lots of back and forth on these...
VicFic18@reddit
I'm assuming you're using CPU offloading? that too how large is your context?
youcloudsofdoom@reddit
198k, as you can see in the prams. And yes, offloading, but less of a penalty on an MoE model than on a dense one (best I can get out of the 27B is 9 t/s generation, for example)
stopbanni@reddit
In M size Qwen3.5 9B gives good results, even with such things as browser use with Hermes Agent
dinerburgeryum@reddit
Qwen3.5-27B still going strong. I wanted to love Gemma4 but once you’ve gone hybrid attention it’s hard to go back to paying the full-fat Attention cost (even with iSWA). Still haven’t really put the Gemma4 MoE through the wringer, but midsized MoEs lost a lot of trust with me from my experience with Qwen3.5. (Probably time to revisit now that all the parsing PRs for llama.cpp have been merged huh?)
zanar97862@reddit
Even with the updates I'm having iffy results from qwen 3.5 35b, lots of thinking in loops and poor tool usage. I got an opus 4.6 fine-tune from hugginface that makes it much more usable.
boutell@reddit
Oh that sounds cool... was it this one? I tried a quant of it just now but the responses I got were pretty much entirely irrelevant to the question:
https://huggingface.co/mradermacher/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF/discussions/1
guiopen@reddit
There are still some problems with tool calls leaking in reasoning
truthputer@reddit
== Coding Only! - I only use LLMs for coding. Home workstation (built well before the RAM-pocalypse):
== Current main coding LLM: Gemma 4. It runs at high speed with a big context window - Benchmarks:
I launch with:
This hasn't been out long, but I've already noticed it sometimes makes mistakes and isn't as precise as frontier models like Claude. But it's fast and reasonably capable, just verify the work it does.
== I'm not using (until the bugs are fixed): Qwen 3.5 35B-A3B, I discovered the following problems:
== I am experimenting with: MiniMax 2.7 226B-A10B - Benchmarks:
I launch with:
This model is 140GB and obviously overflows the GPU, so CPU is at 100%, but it powers through. Probably not worth the electricity cost, but prompt caching works so once you start a conversation it doesn't feel that slow. I had it convert a Python app to Flutter and it mostly worked - although I had to ask Claude to fix some bugs it was having difficulty with.
sk1kn1ght@reddit
I have the exact same config except for 4090 and one of my ram sticks died. So I am stuck at "7" channels which is the bloody worst
truthputer@reddit
My condolences, that must be the worst trying to find a matching replacement stick in this retail environment.
sk1kn1ght@reddit
Thank you. I gave up on it. Maybe my grandchildren will be able to find one from an old bunker somewhere
SirBardBarston@reddit
Fairly new still. Is Vulkan well supported already? Your stats look great. How much is coming from the 7900XTX, how much from the 512GB RAM?
truthputer@reddit
YMMV, but from reading various forums Vulkan Compute support is maturing and in some situations can have performance on par with ROCm but Vulkan has lower memory usage. Anecdotally maturity generally seems to be: CUDA > Vulkan > ROCm. If it was my choice I would argue for the developers dropping ROCm entirely in favor of pooling efforts to improve Vulkan support.
CPU/GPU usage spit varies depending on the model and stage of the LLM processing, initial prompt ingestion usually loads the GPU full, then as processing gets underway the bottleneck shifts to the CPU. With Gemma 4 it's 2% CPU / 80-100% GPU; with big models that offload many layers to the CPU (like MiniMax) the CPU becomes the bottleneck I've seen 100% CPU and 10% GPU usage.
Also I didn't bother testing with a high performance power profile, this is in Power Saver CPU mode.
Objective-Stranger99@reddit
GLM 4.7 Flash REAP 23B A3B UD Q5 K XS
the_auti@reddit
GLM 5.1 FP8 on sGLang 4xB300 Cluster 200k Context 32k Output
Have not finetuned this yet as this is an experiment and costs $30/hour to run.
Can run a dozen parallel agents at extreme speed.
Project: Convert 500k line Node / Express / Pug codebase to Go and React.
14 Hours Run Time
170k lines of Go 35k lines of TS
Output on par with Opus 4.6
Will run e2e testing tomorrow but the initial code review (can you call it that?...code glance) using Opus 4.6 is extremely positive.
Please note this setup is not for the faint of heart. It cost close to $500 just to get this running on runpod. It is still not running "Properly" but it was enough for our experiment.
In the coming weeks we will be testing MiniMax as well.
Travnewmatic@reddit
my daily for Hermes, agent-zero, and OpenCode:
also running an embeddings model (on the CPU) for agent-zero:
been working fairly well. open to suggestions!
running a single Nvidia A10, 32G system memory. 9800X3D.
looking forward to getting my second A10 later this year :)
JaconSass@reddit
Using Gemma4:26b on a RTX 3090 for my AI Home Assistant. I occasionally switch back to Qwen3.5 to keep a perspective.
false79@reddit
gemma-4-26B-A4B-it-UD-Q4_K_XL on 7900 XTX (24GB) on llama.cpp
--temp 1.0 \^
--top-p 0.95 \^
--top-k 64 \^
--ctx-size 128000 \^
--chat-template-kwargs "{\"enable_thinking\":false}"
--jinja \^
--verbose
Pros:
- Dunno if it's the best but I find the quallity is higher than gpt-oss-20b.
- Gained Multi-modal
Cons:
- Output quality is higher on this MoE but it is slower than other MoE's I've tested.
-Once in a while, it crashes maybe twice a week?
Ariquitaun@reddit
It crashes on me at least once a day. More if I hook it up to a coding agent
No-Manufacturer-3315@reddit
Glad it’s not me, people say I am crazy Gemma is crashing, running latest models and llama cpp and runtimes
Borkato@reddit
Gemma has a TON of problems a lot of the time. It’s incredibly finicky even after all the updates and template changes
Independent_Solid151@reddit
When did you last download the model and what's your llama.cpp version.
Chupa-Skrull@reddit
It's crashing more for me after the latest round of big updates than it ever did before tbh
Independent_Solid151@reddit
Reduce the number of checkpoints and parallel instances. You can also increase batch size to 2048, while keeping ub at 1024.
aldegr@reddit
Try -cram 4096, or lower like 0. It might crash because the checkpoints are consuming too much system RAM.
false79@reddit
Thx, I've added --cache-ram 4096 to my launcher. I'll try it out and see if that makes any changes.
I find it crashes more often if the context is like 75% or higher filled.
Eyelbee@reddit
Why are you running it on that setup when qwen 3.5 27B exists? Would be significantly higher quality.
false79@reddit
Had tool issue with cline --tui, I quit on qwen.
Gemma 4 worked out of the box with no issues.
Witty_Mycologist_995@reddit
Yes.
valtor2@reddit
Is it time for the May edition?
venice_mcgangbang@reddit
yes
dco44@reddit
Agentic / Tool Use
Qwen3.5-14B for anything needing reliable tool calling and structured output — instruction following at this size is surprisingly good. Step up to Qwen3.5-32B for harder multi-step reasoning where the 14B hallucinates paths.
Minimax-M2.7 has been the surprise — genuinely competitive with cloud Sonnet for conversational tasks, fits in 24GB at 4-bit. The "Sonnet at home" framing holds up.
Raw coding: Qwen3.5-27B, nothing local has displaced it for me.
superchorro@reddit
What hardware are you running on? I'm very new to this, but I want to do text search, extraction, and analysis on thousands of documents using my desktop, which has an 4070 ti super. Would the models you just mentioned run on this?
dco44@reddit
I’m running on M5 Max 48 Gb but coding agents works with 20 Gb also. You can figure that out. Use links from my public repo and your Claude or other AI agent will install it. https://github.com/dcostenco/prism-coder
simotune@reddit
The “best” model is usually the one you can actually run comfortably. Latency and memory pressure kill a lot of leaderboard wins in practice.
brenden77@reddit
I've got a 3090, and cannot for the life of me get anything to work. I'm not sure what I'm doing wrong or missing.
I can't make a post yet, because i'm new around here. I'd appreciate if someone with a single 3090 would come along and help me with how to set things up.
Mainly, I want to use it for agentic coding. Some of the comments in here clearly show that it's working for others. I just don't understand what it is i'm missing.
If you could share your 3090 setups, models you're using, and configurations I would truly appreciate it!
Thanks folks!
overand@reddit
I'm happy to help! I have a dual 3090 setup, but was on a single 3090 for a while.
brenden77@reddit
Hey, thanks!
overand@reddit
Well, you'll have better luck with "fine tuning" in llama.cpp rather than ollama. But, you'll either need to have docker working w/NVidia / CUDA, or you'll need to compile llama.cpp to make a version you can run. (It's not all that complex, but it can be daunting if you've never done it before, or don't have much CLI experience)
Anyway, get / install "nvtop" and run it to take a look to see if you have things already running and using GPU memory - that'll be a big factor in possible issues.
Once you're clear there: if you want to try this with ollama, my suggestion is:
ollama run hf.co/unsloth/Qwen3.6-27B-GGUF:Q4_K_M (or just "ollama pull ...") - though the llama.cpp way would be:
llama-cli -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M
Can you go into more detail of what you mean by "it's not working"
Jazzlike_Arm6363@reddit
hello, i have hp omen max , AMD Ryzen™ AI 9 HX 375 (up to 5.1 GHz max boost clock, 24 MB L3 cache, 12 cores, 24 threads)
AMD Ryzen™ AI (55 NPU TOPS)
GeForce RTX™ 5080 Laptop GPU (16 GB GDDR7 dedicated) 175 watt tgp
32 GB ram DDR5-5600 MT/s
can i run comfyui ? and which local llm is the best for my laptop ?
i,ll use it for everything even creative and rpg but mostly for work and study and learning and searching...etc you name it
hmmmmm_nl1@reddit
Comfyui is just a program, a scaffold to run models, so yes you can run that. Then it depends what model you are trying to run. You can download models straight from the software, including preset configs, watch a comfyui tutorial on youtube and you should be good to go.
As for general use, download lmstudio and use the model search option, that will give you every model on the planet (almost), and show you if it will fit on your gpu memory. See screenshot, on the left side of the download button you see 'full gpu offload possible'. Best way to start i recon, good luck!
Jazzlike_Arm6363@reddit
i downloaded lm studio and its easy to deal with , downladed qwen latest release 3.6 and it doesnt generate images , is there any model with high tokens that can generate images and uncensored ( not nsfw related )
TacticalGhosting@reddit
Looking to get into local models. already set up LM studio and connected it to Anything LLM.~~~~
Im looking for models that can run on my 8gb rtx 3070, 32gb ddr4, 5600x pc.
I'm looking for specialist models now. One dedicated to coding.
Then one dedicated to general intelligence.
One for creative storytelling.
All of them need.to be able to use tools. And hopefully all the can be almost or entirely inside the 8gb vram...
Especially the non coding ones. And hopefully can be used from ALLM as well.
hmmmmm_nl1@reddit
Just wanted to share my latest inference result for project 'Moriarty' using cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit on a watercooled RTX4090. Getting up to 850 tokens/sec with parallel agents using langgraph->vllm.
The app is build as a general research agent, taking 1 subject like: 'research the history of the Audi RS division' and then running with it for a set amount of time. It will generate sub-topics, and fan out up to 16 langgraph agents to a remote pc for the inference. These will also rate content quality and loop back to the planner. Starting with cheap search options like ddg en searxng, and upgrade to exa deep research to fill in gaps untill all subject are covered.
Locally its all running on a macbook m2, using a bunch of tools to scrape sites, frame analyze youtube videos using vsmol, read pdf files, whisper video and audio to text, translate, run embedding with ollama, save to vector db (lancedb) for semantic search etc. No real usecase yet, just having some vibe lols. The inference is done on a remote pc (windows 11+wsl2/ubuntu).
I've tried multiple models on the PC side (RTX4090), mainly Qwen 3.6 and Gemma 4 lately, both the dense models and MOE variants. In the end both were more then adequate, but MOE gave better speed, and Gemma was 10% faster then Qwen with similar results. Tried fitting different quants, some Q5 models fit as well, but gave no better output quality, and Q4 gave me more cache room, now running max 16 agents with 32K context (depending on amount of sub-topics agents are created). Some results from the logs, ran some random 5 minute benchmarks during research runs:
Benchmark Run #4 Analysis, running 15 agents:
Metric Minimum Maximum Average
Aggregate Speed (Total TPS) 628.0 843.4 756.0 tokens/s
Per-Agent Speed (TPS/Agent) 46.2 64.6 54.0 tokens/s
GPU KV Cache Usage 50.0% 70.8% 58.5%
Max Concurrency — 15 Agents —
Benchmark Run #5 Analysis
'In this run, we saw a lower concurrent load (8 agents instead of 15), which gave us a great look at how the server behaves when it's not fully saturated.'
Metric Minimum Maximum Average
Aggregate Speed (Total TPS) 257.9 659.2 414.2 tokens/s
Per-Agent Speed (TPS/Agent) 72.6 181.6 90.8 tokens/s
GPU KV Cache Usage 6.3% 32.5% 18.7%
Max Concurrency — 8 Agents —
'Observations: Lower Load, Higher Speed: With only 8 agents running, each individual agent received text nearly 2x faster (90.8 TPS avg) than during the 15-agent peak.'
Prompt Processing (Prefill) Performance
Metric Value
Peak Prompt Speed 5,335.4 tokens/s
Average Prompt Speed 1,500 - 2,400 tokens/s
Latency per Prompt. \~200ms - 500ms (for typical Moriarty instructions)
Ive been messing around with MTP aswell, and some turbo and rotorquant, but for my research usecase this multi agent setup (heavily relying on prefix cache) is giving me the best results. Maybe in the future i can add it on top, that might reach 1000TPS? :)
Sidenote on the dflash-mlx model:
Im only using this at the last step, taking in massive context and forming a documentary script of sorts, using all articles and files to create 1 coherent story. Wanted to play around with dflash, but soon learned that would not work well on my system, ended up with the regular gemma-4-26B-A4B-it. This keeps both system on the same model, which makes things easier in the future. On an M2 Max chip this is running @ 100tps.
Series-Curious@reddit
OCR
archieve_@reddit
PaddleOCR
Disonantemus@reddit
Pure OCR Models supported by llama.cpp (preference order)
General models with vision doing OCR (S range: 5GB VRAM):
My tests prompts were:
Note: Spanish language scanned documents
rm-rf-rm@reddit (OP)
Creative Writing/RP
elongated_argonian@reddit
M – Rocinante X 12B. Honestly the best fine-tune of Mistral Nemo I've seen so far for RP. Generates more natural prose than even Gemma 27B, in my experience.
evildark_08@reddit
I discovered this one thanks to you and I had a lot of fun trying it out! Now I have 15k context length so I'll need to learn a bit about SillyTavern how I can summarize the story haha
elongated_argonian@reddit
No problem, happy I could help! If you have the specs for it, Drummer recently made a 16B version, Rocinante XL, which is even better in my experience.
evildark_08@reddit
I have 12gbs of vram.. I'm not sure but I'll try anyway 😃
elongated_argonian@reddit
It's certainly worth a shot! It would be a tight squeeze at Q4_K_M. If you need help setting it up, feel free to give me a holler. I love talking about this stuff.
DragonfruitIll660@reddit
Gemma 4 31B. Even smaller quants feel like a large step up in terms of quality. Doesn't hurt the speed is quick as well on weak hardware. Bit of a positivity bias and the vision seems to struggle a bit relative to writing quality but its good overall (it struggles to read maps a little and doesn't always accurately describe assets/story items.)
a_beautiful_rhind@reddit
Very good for a 31b and thankfully not so censored. Plus it has tuning potential.
Sadly the base is a bit dumber and it seems like context memory use is on the heavier side. Still, it's great to have a fun model for once.
ActivelyCoping@reddit
I’m new here so forgive me for the dumb question but I didn’t know local llms were censored, I thought that the whole point of running an llm yourself was to avoid just that. If the code is open source can you just “jailbreak” it, or would that be prohibitively hard somehow.
a_beautiful_rhind@reddit
The censorship isn't related to the code but how the model is trained. You can jailbreak it or sample around it, but the refusals are still in there. On API you get what you get.
There's also some finetuned or less censored models, abliterations, etc. Make no mistake though, you are still paying for "alignment tax".
ActivelyCoping@reddit
So essentially if too much media on the internet tell the model that some idea is dangerous, it will just regurgitate that to you?
a_beautiful_rhind@reddit
They use safety datasets on local models too. It's not even random internet data but intentional.
ActivelyCoping@reddit
Thats a shame, please tell me they dont have telemetry too…
a_beautiful_rhind@reddit
Look over the software you use because sometimes... But that's not the model itself.
ninjasaid13@reddit
it's soo censored, which is a bad quality for a creative writing / RP model to have.
AdamFields@reddit
Honestly same, Gemma 4 31B made me retire Skyfall 31b, Valkyrie 49b and Anubis 70b basically overnight. I was actively using all three for creative/RP up until I loaded Gemma 4 31b for the first time and just... never went back. The context coherence is what gets me most, it actually remembers what's been established and builds on it properly instead of treating every few exchanges like a soft reset. It referenced a setup/foreshadowing from like 50k tokens earlier that I had completely forgotten about myself and used it to strengthen a scene in a way that felt very intentional and earned. I have never seen a model at this size do that, only frontier models have been able to pull something like this off in my experience.
Really curious what TheDrummer ends up doing with it, as the only area where something like Valkyrie still has a slight edge is pure prose style, that richness and expressiveness in the writing. If TheDrummer can bring that flowery style to a model that already has this level of intelligence and creativity underneath it, that combination would be genuinely unmatched.
Stepfunction@reddit
It also is very good about following System Prompts.
Caffdy@reddit
how do you setup system prompts with gemma4?
LetsGoBrandon4256@reddit
Nothing Gemma4 specific from my own testing. Just shove your writing guide line in the system prompt (1st message for
systemrole) and it should just works.CrowdThumper@reddit
What sort of creative writing are you using these models for i am curious.
AdamFields@reddit
Mostly sci-fi horror and dark fantasy.
bswillie@reddit
Q4 is blazing fast with generous context on a 5090. Base is fine but having a blast trying every revision of u/TheLocalDrummer's Artemis the moment they drop. Not deleting Skyfall yet... But not sure if I'll ever load it back up either. And I loved that thing.
Spitfire75@reddit
is Artemus stable enough to use?
huyanb999@reddit
Great compilation! Qwen models have been impressive lately. The open-source community is really pushing the boundaries of what is possible with local inference.
Weak-Shelter-1698@reddit
Peak experience for me.
MasterKoolT@reddit
Especially good with thinking enabled when you tell it explicitly to check for consistency
Randomdotmath@reddit
This doesn't help much, complex characters still need a SOTA model. That said, the writing is genuinely excellent in this weight.
Genebra_Checklist@reddit
Have you tried to store the details on a database and injet it with the prompt? Im doing something simmilar but with factual information
Far-Low-4705@reddit
it is good for writing human sounding emails and not AI slop sounding stuff.
Also just way nicer to read and chat to. unfortunately a little bit sycophantic for me, and my primary use case is engineering/coding, so qwen 3.5 is the default for me
Genebra_Checklist@reddit
Gemma 4 26B A4B. I was working on a Gemma 3 fine tunning when Gemma 4 launched. Man, we can't even compare other models for creative writing. Works wonders in pipeline with few shots style exemples.
Caffdy@reddit
can you explain this part? how do you setup the pipleline?
Traditional_Chart970@reddit
I want to try it with my new Macbook Pro 64GB ram. I can't try it my MBA M3 16GB :(
Top-Rub-4670@reddit
You can run Gemma 4 26B A4B Q4 just fine on that. It won't fit entirely in RAM but as a MoE they suffer a lot less than dense models when that happens.
I've personally run 26B on a machine with no GPU and only 16GB total RAM and still got 3tg/s, enough to play around with it.
I used llama.cpp with
--mmap --no-repackyarikfanarik@reddit
how did you do it at Q4, my laptop freezes trying to boot it, running iq3_xxs is fine
Badger-Purple@reddit
no that wont work on a mac. macs dont have system ram, its unified. so the 16gb is all there is, no experts loaded to cpu or whatever. op is right. he will have to wait till get gets to his 64gb machine!
Endri-Decuir@reddit
The low refusal rate is a big deal for RP specifically. A lot of models in that size range will break immersion constantly with unnecessary hedging. Good to know Gemma 4 31B holds up there.
IrisColt@reddit
Exactly. You can give Gemma 4 an entire piece of work and ask it to jump in at any point in the story to continue the narrative, enabling an alternate but plausible back-and-forth interaction with the characters... and it works remarkably well. Although far from perfect, no other model in its parameter range matches its capabilities.
Tashimm@reddit
**Coding/Programming (S: \<8GB VRAM)**
For lightweight coding assistance on low-end hardware, I recommend Qwen2.5-Coder-7B-Instruct Q4_K_M (\~4GB VRAM). It's excellent for code completion, debugging, and explanations in languages like Python, JS, C++. Benchmarks show it outperforming similarly sized models in HumanEval (pass@1 \~65%) and MultiPL-E.
Example usage with Ollama: `ollama run qwen2.5-coder:7b-instruct-q4_K_M`
It handles context well up to 128k tokens, great for reviewing medium scripts. Pair with DeepSeek-Coder-V2-Lite-Instruct 16B Q3 (M category) if you have \~10GB for tougher tasks.
Sources: https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct, official Ollama library benchmarks.**Coding/Programming (S: \<8GB VRAM)**For lightweight coding assistance on low-end hardware, I recommend Qwen2.5-Coder-7B-Instruct Q4_K_M (\~4GB VRAM). It's excellent for code completion, debugging, and explanations in languages like Python, JS, C++. Benchmarks show it outperforming similarly sized models in HumanEval (pass@1 \~65%) and MultiPL-E.Example usage with Ollama: `ollama run qwen2.5-coder:7b-instruct-q4_K_M`It handles context well up to 128k tokens, great for reviewing medium scripts. Pair with DeepSeek-Coder-V2-Lite-Instruct 16B Q3 (M category) if you have \~10GB for tougher tasks.Sources: https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct, official Ollama library benchmarks.**Coding/Programming (S: \<8GB VRAM)**
For lightweight coding assistance on low-end hardware, I recommend Qwen2.5-Coder-7B-Instruct Q4_K_M (\~4GB VRAM). It's excellent for code completion, debugging, and explanations in languages like Python, JS, C++. Benchmarks show it outperforming similarly sized models in HumanEval (pass@1 \~65%) and MultiPL-E.
Example usage with Ollama: `ollama run qwen2.5-coder:7b-instruct-q4_K_M`
It handles context well up to 128k tokens, great for reviewing medium scripts. Pair with DeepSeek-Coder-V2-Lite-Instruct 16B Q3 (M category) if you have \~10GB for tougher tasks.
Sources: https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct, official Ollama library benchmarks.
IrisColt@reddit
Qwen 3.5 27B, heh.
I'd make a case for it. I wouldn’t trust it with established lore or especially nuanced prose as a foundation, but with thinking turned off, it's about as fast as Gemma 4 in the same mode, and its grasp of a story in motion is just as good... arguably better. It's a strong choice for quickly sketching ideas or stress-testing worldbuilding, and it has a sharper, more playful intelligence than the blunt simplicity of Gemma 3, even if it still lacks much of Gemma 4's uncanny human intuition.
Ok-Internal9317@reddit
for creative writing it seem to include artifacts from overthinking and also content get boring in my opinion potentially also, due to overthinking, its superb in physics and path tho but in my opinion the gemma can take it
IrisColt@reddit
I agree with you. I was grasping at straws trying to say something good about Qwen 3.5 27B and creative writing... Gemma 4 is such a beast...
HopePupal@reddit
my recommendation for the XL category remains MiniMax: M2.7 seems to be about as good as M2.5 (i'm running UD-Q3_K_S). doesn't need to be abliterated/uncensored/whatever. fewer LLM-isms than any of the Qwens i've tested, although far from zero: it loves vanilla and ozone as much as any other model.
dondiegorivera@reddit
According to Benjamin Marie Minimax degrades a lot for more agressive qwants.
I want to use them for agentic work on my headless server dedicated for inference (Ryzen 9 9590, 2x3090, 128gb DDR4), but I still don't know which qwants shall I choose and what would be the best strategy with llamacpp: keep the experts in RAM and put the attention layers and context to the GPU? I am aiming for 64k context and if possible 2 parallel execution.
HopePupal@reddit
that is a different size category, different hardware, and different application than what i'm using it for, so i have no idea
DeepOrangeSky@reddit
Are you saying Minimax is relatively uncensored? I haven't ever tried Minimax before, but always assumed it was very censored and probably not so good at writing, since I always thought of it as a coding model. Did it used to be more that way for M2.1 and M2.5 and now it is more loosened up for M2.7? Or was it already fairly uncensored, etc, even for M2.1 and M2.5?
Also, LLM-isms aside, how would you say it compares to Qwen3 235b a22b instruct 2507 (and maybe also Step3.5 Flash 197b) for creative writing ability, not in terms of the prose, necessarily, but more so in terms of understanding themes, interpersonal dynamics, human nuances in difficult/awkward situations, etc (things that require it to be smart/deep, not just good at writing pretty looking sentences, that is)? Are those ones as good/better than it, at that, or is Minimax better than those? I have slow internet and a harsh data cap, so I am trying to be selective about which big models I download in a given month. Also, I'll probably be using them at the bottom of Q3 (xxs) or top of q2, since I'll be using them on a mac with 128gb unified memory.
So far Mistral Large 123b/Behemoth is the strongest local LLM I've used for creative writing on my mac, by a decent margin, but I haven't tried to 200b+ MoEs yet, so, I am curious if some of those at the low quants will be able to dethrone it, or even come close, or not (or maybe in some aspects or something). I assume they won't, and that you need to go up to like DeepSeek/Kimi level to beat Behemoth, but, it should be fun to try some new models out.
Kamal965@reddit
I purchased a $10 plan from Minimax that comes with 1500 requests per day, no token limit. Used it for some creative writing and RP. I have to say that I was VERY impressed. Its writing style is actually very unique compared to everything else I've used. M2.5 seems less censored than 2.7, but both work uncensored with an appropriate system prompt. Every now and then I get a refusal, but if I regenerate once or twice it complies. The model is genuinely a breath of fresh air in terms of writing style.
HopePupal@reddit
for reference, i'm using a 128 GB Strix Halo. any 128 GB unified memory Mac has better memory bandwidth than my Strix and possibly better compute.
i've tested M2.0, M2.1, M2.5, and M2.7 at this point, and they'll all write pretty much whatever, given an appropriate system prompt. the part that really impressed me wasn't the prose but behavior over long context — i wasn't expecting any model to be particularly good at callbacks or B-plots or any sense of the passage of time, or at maintaining separate character voices.
i have not tried Step or that Qwen 3, although i do have Qwen 3.5 397B-A17B quantized to within an inch of its life that i could try tomorrow (every smaller Qwen 3.5 has been pants at writing). the GLMs i can run locally aren't good for writing either (although cloud GLM 5.1 isn't bad). GPT-OSS 120B had moments of competence; i might still be experimenting with it if i hadn't discovered MiniMax was good for more than just code.
i've been generally unimpressed with Mistral models so far and gave up on them a while ago. serious tendency to just echo prompts back to me with minimal additions, even with the bigger ones running non-locally like full-size Mistral Large 3. haven't tried Behemoth, but the few "omg you have to try this" fine-tunes i've tried have been indistinguishable from their parents in that respect. always been surprised that the scene came so far on so little. (same with LLaMA models.)
DeepOrangeSky@reddit
Interesting, maybe I'll give it a try
stopbanni@reddit
In "S" size my go-to is still Gemma 3 4B. It's multilingual, it has good support in old versions, and available even in openrouter to try it. Tested on quant Q4_0
Borkato@reddit
Waiting for a heretic of skyfall 4.2 😭
-Ellary-@reddit
For what? It has 0 refuses. Heretic just will hit the overall logic making model to be a yes man.
I mean when you ask heretic to perform random cannibal action, every char gladly agrees.
Borkato@reddit
Are you sure? I’ll try it then, I just get annoyed if there’s ever even a single one. Or talking about medical topics or suicide
rm-rf-rm@reddit (OP)
GENERAL
pepediaz130@reddit
Gemma 4 E4B on Mac Mini M4 (16GB) - Benchmarks (oMLX vs Unsloth)
I've been benchmarking the Gemma 4 E4B models on a Mac Mini M4 (16GB) to find the optimal configuration for coding and technical assistance. The following results compare the standard oMLX quants against the Unsloth UD-MLX (Dynamic) versions using the oMLX engine with Paged SSD KV caching.
Performance Comparison (Generation TPS):
Technical Observations:
The oMLX Standard 4-bit is the most efficient choice for a daily driver. It maintains over 20 TPS at 32k context with a minimal memory footprint (\~4.5GB), allowing the system to handle other heavy processes without lag.
The Unsloth UD-MLX 4-bit offers better logical reasoning and native vision support, though it carries a 20% performance penalty. It is the preferred model for vision-centric tasks or complex debugging where precision is prioritized over speed.
Regarding the 8-bit versions (both oMLX and Unsloth), they perform nearly identically. However, on 16GB hardware, they hit a hard limit at high context. As soon as oMLX begins aggressive SSD paging at 32k, speed drops to \~9 TPS, making 4-bit the only practical option for long-context workflows on this machine.
In summary: Use oMLX 4-bit Standard for speed and general coding; switch to Unsloth 4-bit UD for vision and high-level reasoning.
Trick-Assignment-828@reddit
How are you runing unsloth?
MerePotato@reddit
I highly recommend against 4 bit quants for a model this small and knowledge dense
gandhi_theft@reddit
What t and top_p do you like with this model for coding?
__ahdw@reddit
This is very impressive! I am using TheTom/TurboQuantPlus llama.cpp, Gemma-4-E4B-IT-Q4_K_M.gguf, even 48K context fits in 16GB RAM of my M1 Pro MBP, but the speed reduces from 27 tps (0 context) to 9.9 tps (32K context)
-ctk q8_0 -ctv turbo4
illusionmist@reddit
Can I find oq version on huggingface like other models or do I always have to convert myself? And does it only work on bf16?
Thrumpwart@reddit
Unlimited - Qwen 3.5 122B 8-bit MLX is the best general purpose model for me. Tons of general knowledge, good long context reasoning, and not as slow as you’d think on a Mac.
ilikesmellytofu@reddit
I imagine this requires at least 192GB of unified memory? What's your memory usage like if you run nothing but the model with context?
Thrumpwart@reddit
With some MacOS overhead and Qwen’s very efficient kv size, at 8-bit I use ~135GB memory total.
ilikesmellytofu@reddit
Man, truly unfortunate, just past the 128GB range for M5 Max. Would likely have to downgrade to 6-bit, or significantly reduce kv size
Thrumpwart@reddit
Try an oQ6 or oQ4 quant. I’ve been using them for a bit and they are very good.
Realistic-Advice-199@reddit
Is it mlx-community/Qwen3.5-122B-A10B-mxfp8, or a different one? There weren’t too many 8B ones that I could find.
Thrumpwart@reddit
Either that one or the regular (not mxfp8) 8-bit.
rm-rf-rm@reddit (OP)
if youve used gpt-oss:120b, how does it compare?
Thrumpwart@reddit
I’ve used it in the past but deleted it awhile ago. Maybe I’ll download it again and check it out.
Total_Activity_7550@reddit
Qwen3.5 27B and Gemma 31B.
Anacra@reddit
Is Qwen3.5 27b better than Qwen3.6 27b?
Operation_Ivy@reddit
On the M3 Ultra 512 GB, nothing beats Qwen3.5 397B 8 bit quant from Unsloth. Working with structured and unstructured data, chatting, world knowledge - best generalist agent I could find. I compared to GLM 5.1 and Minimax M2.7. GLM was similar quality but much slower. Minimax was faster but lower quality.
Ok-Internal9317@reddit
yo what's the speed and PP for that?
Operation_Ivy@reddit
For the MLX 8bit on oMLX, I'm getting \~130 tok/s prefill (at 10k ctx) and \~18 tok/s decode
_hephaestus@reddit
somehow not seeing anything from in the oMLX model search for 397B mlx from unsloth, only thing I'm seeing is from a basic
unsloth 397Bquery ishttps://huggingface.co/jackzampolin/Qwen3.5-397B-A17B-unsloth-mlx-4bitDid they remove it/am I blind or is this a gguf?Operation_Ivy@reddit
https://huggingface.co/collections/mlx-community/qwen-35 no official Unsloth mlx
TXNatureTherapy@reddit
Local machine with an Intel i7 14700F, RTX 4080 Super (so 16 GB VRAM) and 32GB DDR5 Ram. If I'm looking for a model that can do image generation and editing (possibly videos, but that isn't my current focus), and that can also do consistent characters and backgrounds, what should I be looking at for a Local model?
MarkoMarjamaa@reddit
As non-english I have to mention gpt-oss-120b. It's working fairly well in Finnish. Tried Qwen3.5, produced a lot of gibberish.
Then found out about EuroEval, EU-funded test bench for european languages.
https://euroeval.com/leaderboards/
Gemma4 seems to be very good in Finnish, will try that later.
I also ran the test with gpt-oss-120b in Finnish.
https://flow-morewithless.blogspot.com/2026/04/mika-kielimalli-on-paras.html
(and yes, the blog post in in Finnish)
Top-Rub-4670@reddit
I can confirm that Gemma 4 is very good in Finnish and Swedish in my tests. It's also good at French. This is less impressive, but what is impressive is that it knows regional dialects, which is usually entirely lost in other models (you get neutral Parisian French with a hint of mimicry of the dialect you specifed, if you're lucky).
WhoRoger@reddit
S for pool people
Still loving Smollm3 3B, I think it's overall the best small model to play with
Granite 4 7B and 1B, underappreciated. 1B can code up some basic things better than Qwen 3.5 2B, and 7B is MoE so much faster than anything at that size
Qwen 3.5 4B is incredible for vision
LFM 2.5 1.2 Thinking, a tiny reasoning brain
InternVL3.5 1B, a tiny competent vision model
Gemma4 E4B shows how far have small models come
OrganicHalfwit@reddit
3060ti 8Gb with 32gb system Ram.
Been using Qwen3.5-35B-A3B-Q4 for large text chains with multiple files and comparison. Got upto a context length of 110k, but it's quality was dropping significantly. \~30t/s
For small and fast questions about the models themselves I'm using Qwen3.5-9B-Q4 \~20t/s
I am currently running on ollama with jan.ai ontop so i'm trying to move over to llama.cpp with webui.
All very new to this though and want to get into image, audio, and video gen.
salmon37@reddit
Hey, how do you share gpu ram with system ram? I'm only getting into local llms and I've la 3080 with 10gb ram and I've been able to run 2 bit quantized models with llama.cpp, but I didn't know you could use regular ram with these models
OrganicHalfwit@reddit
So from my understanding is that only MoE models can share between ram. Take the Qwen3.5-35B-A3B, its not actually a 35B parameter model, its 7 different 3B parameter models which are all honed for multiple different tasks. MoE (Mixture of experts) allows the models that aren't being used to wait in system ram while the singular 3B model which is most relevant for the task swaps in and out.
So effectively you have multiple different brains, which are all pretty smart, waiting in the sideline to sub in for one an other. This lowers total speed on call (but tokens generate still quite fast), it also means that there's alot of excess room in your vram so your context length can be much larger.
However, you are still using 3B parameter models so they can be fair stupid. On benchmarks the 9B basically always beats the MoE, but because its so large I can only use that with a small context window of 4k tokens
with 10gb of vram you have just a little bit more wiggle room than I do so you can play around a bit more. Although for huge conversations (100k +) i think the MoE's are good enough.
As to "how to use", my front end jan.ai does automatic allocation, but it's suboptimal so I have to play around a bit. Specifically enabling "keep all experts in CPU" and I have -1 on offloading model layers to GPU, which honestly I cant remember why i did that as it's kinda counter intuitive.
I'm still learning too at this point. Anyway, hope this helped!
salmon37@reddit
Helped a bunch, thanks so much for the detailed response!
Nabushika@reddit
gguf files (/llama.cpp) is designed to be able to split computation between CPU and GPU (although I've heard ik_llama might be faster). Any model you download will spill over into system ram if you don't have enough vram (with appropriate slowdown).
Objective-Stranger99@reddit
Qwen3.5 35B A3B UD IQ4 XS for deeper thinking.
Gemma 4 26B A4B UD Q4 K XL for better conversational flow.
Hydroskeletal@reddit
Research and ingestion projects is where I'm working local; for coding I've not seen enough to pull me away from Claude/Codex. But when you're dealing with a flood of data, way too easy to blow your token budget.
A couple of M4 Macs for me and I'm all in on Gemma4 right now, 31B and E4B. Qwen3.5-35b-a3b was just the hands down winner but I kept coming back to Gemma and it depends on what you want to put in.
If you hand both Q35-27b and g4-31b a book and say "Write me a detailed book report", Qwen is going to give you the better, longer report. By default Gemma is lazy. You need to tell it to not spare the thinking budget, make sure you're giving it max tokens for output and really tell it all the things you expect in the book report. Then you take the detailed prompt and give it to both and Gemma will have more details in a more concise format. Qwen will repeat the same ideas phrased differently.
Same thing goes for planning. If you tell Qwen "Give me a plan to do X", Qwen does a better job of intuiting what you want. But be specific about what the plan needs to do, the metrics, goals, outcomes, things to account for etc. and Gemma is better.
Where Gemma absolutely crushes Qwen for me though is my source discrimination tests. Qwen is eager to include crap and then hedge that it might be crap, or is perhaps more "crap curious" meaning it will look at something with a crappy abstract and then after wasting time decide it was in fact crap.
So the workflow is using the big dense Gemma as the 'brain', doing the big data work and then delegating out to a small model for very constrained tasks ("Which of these 3 documents meet criteria X?") and e4b really does quite well at this. I was using gpt-oss20b before and e4b is just strictly better. Caveat is that you really need to use thinking. I tried q3.5-9b but it was often slow enough that it didn't make sense. I should probably do more testing for q3.5 at the 4b size.
CatEatsDogs@reddit
Using three llms through my telegram bots -->n8n instance: 1. Parakeet (hope I correctly typed it) if I want to ask something in voice 2. qwen3.5:35b-a3b-q4_K_M if input was text or recognised speech 3. gemma3:27b-it-qat if input contains image
2 and 3 are running on the server in ollama using 12gb rtx3080. Parakeet is running on separate lenovo 720q fully on cpu.
Qwen is mostly used for translating something into my native language. Gemma is used the same way but with images.
I tested image processing in qwen by I didnt like it. Gemma is "smarter". I tried to post random screenshots from random youtube dron flies and gemma recognized more places successfully.
Also tried to use newest gemma4 26b but struggle to disable thinking in n8n.
Farmadupe@reddit
I've got a 3090 + 2060 at home, that's enough for models in the 30B range at \~q5 with llama.cpp
* For general agentic work, qwen3.5-27b at \~Q5 is just about on the right side of competent, with some handholding on MCPs that its given. But with a small set of tools, and the right carefully crated prompts, it can do useful stuff independently.
* For batched logging/classification, I switch to the smallest qwen3.5 possible that I can use. qwen3.5-9b and below fit entirely in the 3090 at fp8 so can run under vllm, which is way way faster and less buggy.
* for some tasks, I can switch to 122b or 397b, but they're orders of magntiude slower so don't get used much.
* qwen3.5 has SOTA-level image comprehension. there's no need to pay money for image classification tasks.
* gemma4 31b is roughly comparable to qwen3.5-27b but not quite as good. The only task that it really beats qwen3.5 is video comprehension. I can stuff 100k tokens into context and get better groundings than qwen3.5 series. The default persona is a bit more pliable than qwen3.5, which can be a bit robotic.
* honestly, at 32Gb vram, I don't think I can replace opus/codex for agentic coding yet. I make my paycheck of coding and qwen3.5-27b is too slow and not brainy enough for coding tasks.
other observations about the state of the world...
* 4 bit quants hardware-optimized quants for vllm are massively lobotomized and not worth using IMO. they're barely coherent. llama.cpp quants aren't perfect but they way more usable than you'd think (I think we all owe a massive thanks to unsloth aessedai ubergarm bartowski mradermacher & co for cramming so much quality into tiny tiny quantizations)
* Abliterated models are all totally useless and lobotomized to hell. But Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking wins Best Name Ever points (and insulting the user points too)
* While llama.cpp can be a bit wobbly at times, the love and care that goes into it really shines through. Without it, we'd all be stuck with running tiny model and lobotomized quants on vllm, adn the locallama space for consumer GPUS would be completely different. So a big thankyou to the llama.cpp devs too!
ogvisit2@reddit
this is so helpful, thanks!
MrB0janglez@reddit
Agentic/coding: running Qwen3.5-35B-A3B-Q4_K_M on a single 3090. Getting roughly 18t/s which is fast enough to stay interactive. The A3B variant is way more practical than the full 235B for daily use without a multi-GPU rig. Tool calling has been solid with llama.cpp function calling template. Tried Minimax-M2.7 briefly but can't run it locally with my current VRAM. GLM-5.1 is impressive on focused tasks but loses coherence on longer agentic chains in my experience. Qwen3.5 is my daily driver for anything coding related right now.
youcloudsofdoom@reddit
Just FYI I'm getting 35 t/s on a 4060M laptop GPU with that model and quant, so you definitely want to look at your llama.cpp settings.
MrB0janglez@reddit
went back and checked after seeing this -- had -ngl set way too low so it was only offloading a fraction of layers to the GPU. bumped it to 99 and speeds jumped significantly. classic setup mistake, thanks for the nudge.
brenden77@reddit
Mind sharing your setup? I'm trying to get a single 3090 running, but nothing i've done works well or consistently.
youcloudsofdoom@reddit
Glad to hear it, happy to help!
MrB0janglez@reddit
Good to know, that's a meaningful gap from what I was seeing. I'll dig into the llama.cpp settings, probably something off with the thread count or context size config. Appreciate the heads up.
IntelligentDog7952@reddit
this is very small, I run this model on 3090, it fits completely when using 64k context and outputs from 125-138 tokens per second.
mudkipdev@reddit
This is very strange. With a 3090 you should be looking at at least ~60 tokens per second.
sk1kn1ght@reddit
That's true. 4090 with dense I get 45tps while severely undervolted. Your settings or smth must be butchering you.
mrtrly@reddit
Qwen3.5-27B has been my daily driver for agentic coding on a single 3090. Thinking off for tool calls is the move because the reasoning tokens add latency without improving function selection. The 27B quants still punch way above their weight class for structured output.
brenden77@reddit
Any chance you'd be willing to share your setup and config? I've tried that model and cannot get it to work at all.
ghunny00910@reddit
How about now? 3.6?
CulturalSock@reddit
Has anyone used a medium model (<32GB) as a plain code executor, while using gpt/opus as planner? How does it perform?
Confident-Cry-1581@reddit
Got told off at work that they'll cut me off from Claude if I don't stop using it on the weekends.
So what could I use locally for agentic coding? I have a 9950X3D / 96GB / 5070TI 16GB
Any tips would be appreciated :)
Longjumping-Move-455@reddit
If you didn’t get an answer qwen3.6 35b using llama.cpp :). I suggest using unsloths gguf. https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
muyuu@reddit
May thread when?
rm-rf-rm@reddit (OP)
Speciality
(includes medical, legal, accounting, math etc.)
Disonantemus@reddit
OCR
HunyuanOCR
llama.cppUse case: convert documents with tables or charts to markdown/mermaid, that is hard for traditional OCR (Tesseract).
UpsetEmotion6660@reddit
Edge AI / IoT inference on constrained devices:
S (<8GB): For always-on sensor inference on MCU-class hardware (STM32N6, ESP32-S3, Syntiant NDP120), you're not running LLMs — you're running quantized TinyML models (keyword spotting, anomaly detection, vibration classification). TensorFlow Lite Micro and ST's NanoEdge AI Studio are the practical tools here.
M (8-32GB): This is where edge inference gets interesting. Running quantized 4B models on Jetson Orin Nano or RPi5 + AI HAT for real-time vision, predictive maintenance, or local NLP. Gemma4 e4b quantized is genuinely usable for on-device agentic tasks in industrial IoT — local decision-making without cloud round trips.
The underappreciated angle for this community: the biggest constraint for edge AI isn't model quality anymore — it's the orchestration layer. How do you push model updates OTA to a fleet of thousands of devices running different hardware? How do you handle inference when connectivity is intermittent? The model is the easy part; the distributed systems around it (connectivity management, fleet OTA, telemetry collection) are where most deployments actually struggle.
For anyone building local-first AI systems that need to work in the field, not just on a desktop — the connectivity and fleet management stack is where to invest your time.
lightningsiax@reddit
I'm currently working in this field, specifically i work in the device software and management while theres a team responsible for our inference creation, do you mind me asking what purposes you've seen edge AI on devices such as an orin nano used for?
ufos1111@reddit
TranslateGemma for translations, pretty neat! :)
Traditional-Gap-3313@reddit
Gemma 4 26B for my agentic legal search. Running on vllm on 2x3090 and it's quite fast in both prompt processing and decoding. Can fit 65k context which is enough for most searches, since I do several search agents in parallel and combine the results.
31B won't work in vllm due to broken KV Cache allocation. It will work once that if fixed. Until then, max I could get is 9k context on 2x3090s. (https://github.com/vllm-project/vllm/issues/39133)
Tested Qwen 3.5 35B moe, not that good in the legal text understanding benchmark, even though it works agentically. Qwen 27B has the same problem with kv cache allocation as Gemma 4 31B.
However, tested Gemma 4 31B over openrouter and it's significantly stronger in tool calling then 26B. My tool calls have additional optional parameters besides the query. 26B NEVER EVEN ONCE called it with optional parameters. A few tests I did with openrouter 31B showed that that model understands the tools a lot better. So I'm waiting for the vLLM to fix the bug with kvcache allocation and I'll migrate everything to 31B or Qwen 27B
NetFantastic4245@reddit
What are you searching in, do you have a RAG thing?
Traditional-Gap-3313@reddit
yes, complete vector+bm25 store with specific retrieval optimizations. I also exposed specific search functions over mcp so that agents can directly call them instead of using a simple query->retrieve->generate pipeline. Tested it primarily thorugh Claude Code's native MCP calling, but then I also tested smaller models as agentic drivers. Gemma is quite good.
Borkato@reddit
Gemma 4 36B is god tier at teaching and explaining. It’s even better than qwen.
It’s also relatively uncensored which is super fucking surprising.
cleverusernametry@reddit
36B?
Borkato@reddit
Oh uhhh whatever the big MoE is lol I always forget the params. The 36 A4B or whatever
Top-Rub-4670@reddit
Scary to think that you're in the accounting/legal/medical profession.
Borkato@reddit
I got it confused for qwen because I aliased them to gemfast, gemsmart, qwenfast, qwensmart, etc. sue me
cmdwedge75@reddit
Gemma4 26B A4B
Borkato@reddit
Yea that thing lmao
llama-impersonator@reddit
dots.mocr is really good at reading non-latin text, even arabic, hebrew and cyrillic. very easy to make a google lens style workflow with it and a gemma model.
TylerDurdenFan@reddit
CPU-only inference:
Bonsai-8B by PrismML. It's a 1 bit refinement/quant of Qwen3-8B.
Being 1.15GB not only means it needs less RAM, but for CPU inference it needs way less memory bandwidth. 20 t/s on a 4 core VM on EPYC. The 1.7B version ran 2 t/s on an old Celeron N3160.
Use PR#10 of their llama fork for CPU inference.
stopbanni@reddit
Gemma 3 4B is still I think best model in S size which can be good at russian, even knows animal sounds (for some reason, qwen3.5 9b thinks that cow says meow, lol)
Tyrannas@reddit
Churro OCR quantized Q4_K_M for historical documents OCR https://huggingface.co/mradermacher/churro-3B-GGUF
noddy432@reddit
Would you mind sharing your workflow app for OCR? I'm looking for a good OCR for historical handwritten documents. Thanks.
Tyrannas@reddit
Sure so I downloaded the gguf model and served it locally with llama.cpp llama-server, then I use this basic snippet:
```
I use the stanford-oval profile to get structured XML output but you can also have raw text by changing the profile.
MuDotGen@reddit
How does this do for modern official documents? Languages? Such as Japanese? I'd been using PaddleOCR for that use-case so far, but a 3B Q4 model seems tempting.
Tyrannas@reddit
Haven't tried on modern, but I'm pretty sure you can find better since Churro is a Qwen2.5 fined tunes on 100k historical documents. Maybe look on https://huggingface.co/collections/ggml-org/ocr-models to find other options ?
CodeCatto@reddit
What are the best coding models to run on a 12GB RTX 5070Ti?
tthompson5@reddit
I have a RTX 4070 Ti (also 12GB). I'm NOT a coder, but people on here say Gemma-4-26b codes well, and I got the UD-IQ4_XS quant from unsloth to run on my machine (https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF) at about 40t/s on startup (slower for longer contexts). It's probably worth a try for you.
I also successfully got it to write a couple of bash scripts for me (not big coding projects) and refactor a simple R script. From my experience with it, it seems reasonably competent. I use llama.cpp to serve the model.
Tuning it for speed versus hogging all the system RAM is still ongoing, but if you want it, I can share my full current start-up script for it. A lot of experimentation and trail-and-error has gone into it, and I'm still experimenting with tuning it. I'm not using the mmproj file to enable vision by default. Unless you need the vision for your use case, it's better to leave it off and save yourself the VRAM/RAM.
Anyway, these are the important flags I'm currently using, and I hope they'll give you a good starting point if you decide to try the model.
--ctx-size 100000
--parallel 1
--cache-type-k "q8_0"
--cache-type-v "q8_0"
--flash-attn on
--swa-checkpoints 2
--cache-ram 2048
--checkpoint-every-n-tokens 16384
--defrag-thold 0.1
--temp 1.0
--top-k 64
--top-p 0.95
--min-p 0.02
--repeat-penalty 1.1
--no-mmap
--cpu-moe 12
--jinja
--chat-template-kwargs '{"enable_thinking":true}'
Screenshot from running it on my machine with the car wash test question:
mr_357@reddit
I tried running gemma4-26b-Q4-K-M on my 5070 and while it's amazing for creative tasks and decently fast, it really struggles for coding if provided tools (like running it through copilot, cline or other vscode extensions). I had a lot more luck with mistral's devstral, but obviously it's much much slower because it doesn't fit in the VRAM
tthompson5@reddit
Fwiw, I did write my comment above before Qwen3.6 came out (either version), which is also worth trying.
I'm still having decent luck with Gemma-4 writing me scripts and such (including debugging them). Last night I had it write me a script to send a bunch of battlemap images (one at a time) to llama-server and get back descriptions and tags and save them as properly titled json files. All that said though, I have heard more from some redditors that say Gemma-4 struggles with a true codebase, which may be a result of how its attention window works.
I don't know if you care to tinker more with Gemma-4, but if you do, the jinja chat template is still broken for tool calls even though Google updated it again just a few days ago. You can try the fix detailed here: https://www.reddit.com/r/LocalLLaMA/comments/1syps6i/i_stumbled_on_a_gemma_4_chat_template_bug_for/ After I started using that template, the number of failed tool calls from Gemma dropped noticeably.
mr_357@reddit
yeah I just tried out qwen 3.6-35b-a3b and it's pretty good, it doesn't always make correct code on the first try, but with some guidance it can get stuff done much faster than me typing it
I'll try out the fixed chat template for gemma, but other than the tool calls it also sometimes starts looping in on itself and doesn't seem to have good training data for stuff I want to use it for (game dev)
Sergei-_@reddit
hi, are is this app are you using to chat?
tthompson5@reddit
You mean where did I get the screenshot? It's part of llama-server. I can talk to the AI by opening my web browser and going to localhost port 8080 (or whatever you set when you launch the server).
Right now, I'm mostly using Mistral's Vibe to actually work with the AI, but there are better harnesses depending on what you want to do.
PairOfRussels@reddit
I've been running llama.cpp with qwen 3.5 (now 3.6) 35B A3B model. I started with a context size that I need (70K context size for example) put all the layers on GPU, then put as many MOE experts on CPU/DRAM until I have all the model and context fitting in the 10GB VRAM (and none in the 24GB shared VRAM.. because as soon as I share between VRAM and Shared VRAM aka DRAM it slows to PCIE transfer speed).
This gets me about 100t/s prompt eval and 30t/s token generation.
Is there a better model and start params to use for a 3080 RTX to do agentic coding with Cline?
quickreactor@reddit
AreaExact7824@reddit
Best agent for subagent
Party-Log-1084@reddit
Building a completely local, uncensored RAG setup for sysadmin tasks (ingesting logs, docker-compose, PDFs as strict source of truth).
Specs: Linux Mint 22 | Ryzen 9 5950X | 64GB RAM | RX 7800 XT (16GB VRAM)
Need advice on optimizing this AMD stack:
Any insights appreciated!
oldtekk@reddit
What llm for roleplay for simulating hand to hand combat? so needs goo spatial awareness, etc
Aaronski1974@reddit
Minimax 2.7 unsloth 2bit u m or something. Amazing. Best local model I’ve ever used by far. Getting 40tps and about 15s to process 40k token prompt. Instant once it’s cached, and maybe .5s to first token on an empty cache on a dgx spark. It’s replaced haiku for me. Replaced sonnet too for non-coding. It gets stuff.
zanar97862@reddit
What hardware are you running for those speeds? Even small quants of 2.7 are still huge for local so wondering what it takes to get reasonable performance.
Qwen30bEnjoyer@reddit
Not OP, but I want to chime in that I got ~150 TPS PP and ~5 TPS TG on 96GB Ram 16GB VRAM inference using 2 x 48gb 4800MT/s DDR5 and a 6800xt.
Aaronski1974@reddit
Surprisingly capable isn’t it? One thing I’m not sure if it’s the low quant or just the model, but I have one server named razer, it’s a razer laptop. And for the life of minimax it keeps correcting the name to razor. It’s common failure mode when your think is similar to a very famous thing the model just goes with the similar thing. With 5 tps, one solution is to go completely asynchronous. For example, I have a single cron that fires hourly and pokes opencode to look at open GitHub issues and just work on them and document. In 3 days it’s opened a bunch of issues and closed them, 58 so far. If I have an idea I. Claude code and I don’t want to get Sadie tracked I just have it pop in a gh issue for later, then open code comes along and does the issue.
Icy_Lack4585@reddit
GB10. the MSI branded DGX spark. with a little upgraded cooling.
Perfect-Flounder7856@reddit
That's an interesting model quant. Still holding up?
Aaronski1974@reddit
I got a second dgx spark and moved to 3bit awq in vllm. It is a solid coder in opencode. I built a Claude code mcp server for it, we chat, it reads a ton of context- then farms hard stuff to Claude. Works great. I got like 4 mil tokens of context to play with and a 95% cache hit rate, an combined 90 ish tps at 32 simultaneous threads.
vex_humanssucks@reddit
Thanks for putting this together. One thing I'd add: context window size is increasingly becoming a differentiator for local models. Running a 26B+ model with 128K context usably requires significant VRAM, so the sweet spot for most home setups is still 7-14B with careful quantization. Would be interesting to see a VRAM budget breakdown alongside the benchmarks.
R_Duncan@reddit
That's for dense models. Qwen3.6 35B-A3B can run with over 128k context in 8Gb VRAM
2funny2furious@reddit
Have old hardware. A 1080 with 32gb ram. Using llama.cpp, I can get about 18 t/s out of qwen 3.6 35b in q5. Q4 goes to about 22. That is using 32k context. What settings/speeds are people getting on old hardware out of it?
CrowdThumper@reddit
I am really enjoying using Qwen 235B A22B thinking and it works beautiful.
PsychologicalMail198@reddit
What is your system specs?
I have a Nvidia RTX 4070 Ti Super, AMD Ryzen 7 7800x3D and 32 gigs of Ram but the response time is very inconsistent some times it's under 30s and most of the other times it's takes a very long time or goes on forever.
I'm not able to understand if my system is reaching is limit or something else is the issue because the CPU & GPU use is always below 60%
overcompensk8@reddit
Out of VRAM - check nvidia-smi
Constant-Bonus-7168@reddit
Appreciate this — the community benefits from thoughtful posts like this.
Practical-Charge8321@reddit
I was pretty happy with qwen 3 8B from a quality perspective, but it's pretty slow on my tiny 8GB VRAM rig
vex_humanssucks@reddit
Good list. One I'd throw in the "solid mid-tier" bucket that doesn't get enough mentions: running Qwen3.6-27B at Q4 on 24GB VRAM is genuinely usable for agentic tasks where you need speed over max quality. The quantized version holds up surprisingly well on structured output generation compared to the full weight.
Artanisx@reddit
Need recommendations for Coding :) M/L = 24 gb VRAM and/or 64gb RAM
I would like something that perform similarly to Perplexity Pro on coding tasks.
ChatEngineer@reddit
For agentic coding workflows specifically, Qwen3.5-35B-A3B is a solid pick (as mentioned above) — but don't overlook the importance of
parallel_tool_callssupport. That single feature dramatically changes throughput on multi-tool workflows. If you're using llama.cpp, make sure you're on a recent build since function calling / tool use support has improved significantly in the last few months. Also worth noting: for local agent stacks, the model's instruction-following on structured tool call JSON matters more than raw benchmark scores. A smaller model that reliably produces valid tool calls beats a larger one that occasionally hallucinates tool schemas.agenticaipaglu@reddit
What is the best model for data dictionary mapping?
PayNo6483@reddit
“Best” lists change monthly. i've found it’s more useful to pick models that match your hardware and tasks
rileyphone@reddit
Is anyone still using base models? For open-ended text generation (like with looms or mikupad). Now that Hyperbolic 405b base is down the only API option is text-davinci-002. I'm back to using Llama 3.1 8b local but there has to be something better that isn't annealed to death.
deltan0v0@reddit
featherless has some base models on api. nothing huge but might help you if you're bottlenecked on ram
i've been liking kimi linear base, hosted locally. it is hard to find ones that aren't filled with synthetic data, tho, yeah. mistral small 24b base is also quite good
trinity nano's fast but ime less good than llama 3.1 8b / mistral 7b
everything better is quite large, how much ram do you have?
(also, loom mentioned, which do you use? i've been using tapestry-loom)
rileyphone@reddit
Trying Kimi Linear Base now. Mostly working on my own loom projects, right now one in a phone form factor.
Ok-Internal9317@reddit
hmm you seems to be outdated, i havent heard of llama3.1 since a long time
HopePupal@reddit
i'm not using them, but there are base models available for Gemma 4, Nemotron 3, the smaller versions of Qwen 3.5 (including 35B-A3B but not 27B)
pmttyji@reddit
https://huggingface.co/shb777/Llama-3.3-8B-Instruct-128K - u/FizzarolliAI
officialAdfs_m0vie@reddit
General usage, Medium (I have 16gigs of vram)
Joozio@reddit
For coding specifically the local vs frontier gap narrowed a lot. Ran Aider with a few local models alongside Claude Code and Codex CLI - the harness configuration made more difference than model choice at the margin. Not sure this generalizes but for structured coding tasks local at 70B+ is surprisingly close.
Skid_gates_99@reddit
Qwen3.5-27B on a single 3090 for most of my agentic work. bartowski Q6_K quant, 64k context, thinking off for tool calls because it wastes tokens reasoning about which function to invoke when the schema already tells it everything it needs to know. Gets me around 20 t/s on generation which is fine for agent loops where the bottleneck is the tool execution anyway.
Tried Gemma 4 26B for a week and went back. Quality is genuinely good when it works but the crashes and the tool call formatting issues killed my trust. I need something I can leave running overnight on a multi step workflow without babysitting it. Qwen has been boring and reliable for that which is exactly what I want.
Have not tried GLM 5.1 yet but the benchmark post from earlier today has me curious. If anyone is running it locally for agentic stuff I would love to hear how the tool calling holds up.
_derpiii_@reddit
pretty sure this account is a bot.
TheLastSpark@reddit
How are you fitting a q6 27B and 64k context? All of it can't fit in vram - right?
apollo_mg@reddit
Probably Turboquant or a variant for KV cache.
TheLastSpark@reddit
But even q4 there's no way unless I'm missing something
Cooproxx@reddit
what kind of agentic work do you do? Curious what’s possible on the 3090
sk1kn1ght@reddit
Some new research has come out examining the new mythos from anthropic. There is a strong consensus based on available data that the model itself is not THAT much better(it's still better but not that much). What instead allowed these improvements is it's loop Iteration and some people have replicated quite amazing results with 27b and 31b with loop Iterations
Fresh-Resolution182@reddit
Qwen3.6-35B-A3B for coding agent driver, the speed change is real — something about watching tokens fly at 120 t/s makes you actually run tasks instead of batching them. Gemma4 e4b for quick questions when you don't want to spin up the full reasoning chain
Ok-Internal9317@reddit
what is the best model under 1B?
Same_Platypus1629@reddit
Best for article writing? 5070ti 16gb vram, 32gb ram
Ok-Internal9317@reddit
qwen 3.5 9b
nerdylicious05@reddit
I would love to hear what people are using with Home Assistant. Tried llama3.1:b with mixed results, but I am new to local llms
MarkoMarjamaa@reddit
There are some smaller models fine-tuned for Home Assistant in Huggingface ( just search for Home Assistant).
I'm using gpt-oss-120b because I'm using HA in Finnish.
Ok-Internal9317@reddit
its also the cheapest for inferencing (on openrouter), yeah its pretty good stuff
thavidu@reddit
Thats a super old model at this point, I haven't used HA in a long time but you should try a more recent release small model, like maybe one of the smaller param Gemma4 models or smaller param Qwen3.5 models
CriticalCup6207@reddit
For finance/reasoning workloads specifically: Qwen3.6-35B-A3B is our daily driver for structured data extraction and code gen. For long-context document analysis (10-K filings, earnings transcripts) we're still on Gemma-4 27B — handles 128k context more gracefully in our testing. The 4-bit quant quality has gotten good enough that I've stopped caring much about the precision tradeoff for most tasks.
ActiveCheap4371@reddit
qwen3.6 35b a3b
melspec_synth_42@reddit
been running qwen3.6 locally, the reasoning quality jump over 3.5 is noticeable. anyone on 24gb vram hitting issues?
Big_River_@reddit
wonderful post - tip of infrastructure to ya
DeepBlue96@reddit
for anything "light" qwen3.5 9b, for coding qwen3.5 35b a3b, also we are talking local not impossible to host except with a 6000$ setup pls
Ha_Deal_5079@reddit
glm-5.1 for agentic coding hands down. managing skill configs across agents is annoying as hell and https://github.com/skillsgate/skillsgate has been helping with that
JournalistLucky5124@reddit
Need recommendations. S = 4gb vram and/or 16gb RAM 🙃🙂
WhoRoger@reddit
Smollm3 3B, LFM 2.5 1.2B, Granite 7B (MoE, offload to RAM), Ministral 3B
ben_g0@reddit
I'm running Gemma 4 E4B on my phone with 16GB RAM, and it's quite responsive (even when running on the CPU) and surprisingly capable for a small model. I'm not sure it's going to work well on your GPU but it'll likely still be performant enough running on the CPU.
DoorFit7827@reddit
I’ve been testing the Gemma 4 / Qwen hybrid-attention costs as well. The thermal throttling is the real bottleneck. I actually managed to stabilize the flow using a deterministic routing logic (Dirichlet-Shift) that cuts redundant cycles. Verified a 16.8% energy recovery via ZKP, which keeps the clocks higher for longer. I’ve put the skeleton on GitHub if you want to see how to bypass the standard JAX-level friction: https://github.com/BerzeShift/Berze-Shift
Essentially, there is no reason AI should waste energy when 100% of the time in 1 million simulations that energy is useless. Like a human counting floor boards in every room it walks into or putting on a life jacket when they enter their 20th floor office to be safer.
Accomplished-Pea3574@reddit
coding and to learn how to code, I have a MacBook m4, 16gb
h-mo@reddit
For agentic/tool use workloads (the main thing I care about professionally), I've been routing between two models depending on task complexity - a smaller fast model for routing and triage, a larger one for actual reasoning. Running this through Open WebUI with a custom pipe that scores complexity before deciding which model gets the call. For the larger slot, Minimax-M2.7 has been surprisingly capable - genuinely Sonnet-level on structured output and multi-step tool use, which matters more to me than benchmark scores. For the smaller slot, anything in the Qwen3.5 family at Q4 handles classification and short-context tasks cleanly without burning tokens. The key thing I've learned is that for production pipelines, instruction-following consistency matters far more than raw intelligence - a model that follows system prompt constraints 98% of the time beats a smarter model that goes off-script.
No-Judgment9726@reddit
Been running Gemma4 on M4 Pro for about a week now, Q4 quant. Honestly more impressed by the instruction following than the raw benchmark scores — it just gets what I mean more often than Qwen3.5 at similar sizes.
GLM-5.1 is interesting too but haven't had time to properly test it yet. Anyone done a side-by-side with Gemma4 on coding tasks specifically?
david_0_0@reddit
the vram tier breakdown is really helpful. wondering how much the context window length varies across these tiers though. feels like when youre comparing the S tier models, a 4k vs 8k context window can massively change what use cases actually work. do the ones in that range handle long-context stuff well or does performance degrade pretty fast?
No-Judgment9726@reddit
GLM-5.1 has been surprisingly strong for its size class. What's interesting is that it's not just benchmark numbers — the actual instruction following quality feels noticeably better than what the evals would suggest.
For anyone running on Apple Silicon, the Gemma4 series has been my go-to recently. The memory efficiency at Q4 is impressive, and MLX support is solid out of the box. Would love to see more on-device agent-oriented models make this list in the future — the gap between cloud and local is shrinking fast.
david_0_0@reddit
the breakdown by vram tier is really useful. curious about something though - when youre evaluating models in the M size range (8-32GB), how much weight do you give to context window vs raw inference speed? feels like that trade-off massively changes depending on the actual task but i dont see much discussion of it
Human-spt2349@reddit
Gemma 4 is good
LordStinkleberg@reddit
Best model & quant for agentic coding on a fresh Mac M5 128GB?
brandybuckferryman@reddit
What are the best coding models (preferably run with CC) to run on a 24GB AMD GPU 7900 XTX and 64 GB system memory?
Internal-Month-4812@reddit
Interesting to compare outputs across models
on identical inputs.
LLaMA 3.3 70B via Groq consistently catches
different issues than GPT or Claude on the
same codebase — each model has systematic
blind spots the others don't share.
Has anyone done structured comparisons of
per-model accuracy on specific task types?
FlightCautious3748@reddit
minimax m2.7 has been the most useful for client work lately, team was skeptical but the throughput on longer context tasks is actually solid for the cost of running it locally
__Captain_Autismo__@reddit
Startup founder - 96gb vram ( rtx 6000 pro )
Coding, writing and web dev.
General purpose: Minimax 2.5 reap q4
Web dev: Gemma 4 31b it bf-16
Full tool use through both on my built from scratch agent harness. Manage workflows through my control surface.
Around 80-90% or more of my ai usage is now local and the workflows get better daily.
I don't use small models for anything unless it's something like embeddings. Always throw as much compute as I can at it.
MajinAnix@reddit
Wrong question, we need best fastest possible model, sweet spot
Fuzzy-Layer9967@reddit
Any focus on difference between AMD / arm architectures ?
Spirited_Maybe7374@reddit
what's the best model for text summarization? I have an M1 Pro Max with 32GB
pmttyji@reddit
Any recent 4B models are enough for this. Try Qwen3.5-4B for example. If you want more better, Qwen3.5-9B there.
Ki1o@reddit
What's best for coding in a RTX6000 maxq ? I'm running qwen3.5-27b unsloth currently and whilst it works well it large contexts .. I am curious if I should be experimenting with ithe models for this 96 GB VRAM card
sagiroth@reddit
Single 3090 and 32gb RAM still Qwen27B Bartowski/Qwopus best for coding/ agent tool calling with opencode?
Novel_Law4469@reddit
So far i've tried
gemma4:26b-a4b-it-q4_k_m (approx 20toks)
qwen3:30b-a3b (approx 10-11 toks)
on a 8GB RTX4060, with 48GB RAM machine
basic prompts - no probs.
mid-complex prompts - no too bad either, was able to handle stuff like 'Design a PostgreSQL database schema for a multi-tenant SaaS application' and RLS stuff pretty okayish too.
tidel@reddit
Anyone got any success using vllm? I’m successfully running qwen3.5-35B on llama.cpp and wanted to try vllm just to have a reference, and I can get it to run, tools calls and thinking are painful to get right. And! I’m definitely looking in the wrong place on how to get this right ….
truedima@reddit
Which vllm version were you trying? I havent used 35b a but I use 27b a bunch on vllm. Both as cyankiwi awq quants (int4) and both work fine, with tools and all. Thats on 0.17, 0.16 didn't have support yet and iirc somewhere between 0.17 and 0.19 there might have been a few regressions that crept in (see changelog).
tidel@reddit
I just found this post: https://www.reddit.com/r/Vllm/comments/1skks8n/qwen_35_27b35ba3b_tool_calling_issues_why_it/ from 5 hours ago and directly testing :)
jinnyjuice@reddit
Please break down more categories for >128 GB. You don't have to label them with 'S' 'M' etc. Just use the number ranges.
AI_Conductor@reddit
For agentic and tool use on constrained hardware, the GGUF quantization path has been the unlock for us. We run a ComfyUI image generation sidecar on a 4GB VRAM Turing GPU (T1000) and had to make hard choices about model formats.
The key insight: on Turing hardware (compute capability 7.5), there is no fp8 execution silicon. Hopper added that. So fp8 models that claim to be lightweight actually run as software-emulated fp16, which is slower than just using a properly quantized GGUF. We run FLUX Schnell via a Q4_K_S GGUF (6.7 GB, fp16 compute path) instead of the fp8 all-in-one checkpoint (17 GB, emulated fp8). The GGUF is both smaller and faster on this hardware.
For the agentic orchestration layer specifically, we use Claude Sonnet via API for the reasoning and tool-calling backbone. But the image generation, TTS, and other media sidecar tasks all run through local models in Docker containers. The architecture is a thin MCP server that routes tool calls to the appropriate backend, whether that is a cloud API or a local container.
Hardware: NVIDIA T1000 (4GB VRAM), 32GB system RAM, WSL2 with 31.2GB envelope. The ComfyUI container gets a 24GB cgroup ceiling to handle FLUX peak memory via CPU offload. Everything else stays under 7GB combined.
Size category: S (under 8GB VRAM). Making it work at this tier requires being very deliberate about quantization format and memory management.
Farmadupe@reddit
I've got a 3090 + 2060 at home, that's enough for models in the 30B range at \~q5 with llama.cpp
* For general agentic work, qwen3.5-27b at \~Q5 is just about on the right side of competent, with some handholding on MCPs that its given. But with a small set of tools, and the right carefully crated prompts, it can do useful stuff independently.
* For batched logging/classification, I switch to the smallest qwen3.5 possible that I can use. qwen3.5-9b and below fit entirely in the 3090 at fp8 so can run under vllm, which is way way faster and less buggy.
* for some tasks, I can switch to 122b or 397b, but they're orders of magntiude slower so don't get used much.
* qwen3.5 has SOTA-level image comprehension. there's no need to pay money for image classification tasks.
* gemma4 31b is roughly comparable to qwen3.5-27b but not quite as good. The only task that it really beats qwen3.5 is video comprehension. I can stuff 100k tokens into context and get better groundings than qwen3.5 series. The default persona is a bit more pliable than qwen3.5, which can be a bit robotic.
* honestly, at 32Gb vram, I don't think I can replace opus/codex for agentic coding yet. I make my paycheck of coding and qwen3.5-27b is too slow and not brainy enough for coding tasks.