[-]

kelembu@reddit

Uncensored answers, Ryzen 5800x, 32GB RAM, 16GB VRAM AMD 6800 Gpu

[-]

rm-rf-rm@reddit (OP)

Agentic/Agentic Coding/Tool Use/Coding

[-]

I am using unsloth/Qwen3.5-35B-A3B-Q5_K_XL and getting excellent results. I am using it over 27b for memory management and speed because I am testing a config that works without any cloud services and seeing how much quality I can get if I load everything at once.

I have ASR, TTS, text2Image, image2image, LLM with vision and embeddings simultaneously.

System: 96 GB RAM, 56 GB VRAM total (RTX 5090 + RTX 4090)

unsloth/Qwen3.5-35B-A3B-Q5_K_XL

mmproj-F16.gguf

llama.cpp config:

ctx-size=262144

threads=16

parallel=5

cache-ram=8192

n-gpu-layers=999

kv-unified=1

jinja=1

cont-batching=1

Using unsloth guide rec's for inference settings

temperature=0.7

top_p=0.8

top_k=20

min_p=0.0

presence_penalty=1.5

repetition_penalty=1.0

thinking toggle via chat_template_kwargs.enable_thinking (off in most but not all agents)

parallel_tool_calls=true <-- VERY IMPORTANT FOR OUR USE CASES

Image stack models/config:

diffusion: flux-2-klein-4b-Q4_K_S.gguf

VAE: full_encoder_small_decoder.safetensors

text model: Qwen3-4B-Q4_K_M.gguf

defaults: steps=4, cfg_scale=1.0, strength=0.75

Other local models in same runtime:

Embeddings: microsoft/harrier-oss-v1-0.6b

ASR: Qwen/Qwen3-ASR-0.6B

TTS: microsoft/VibeVoice-1.5B + Qwen/Qwen2.5-1.5B tokenizer

[-]

Substantial-Flow9244@reddit

Do you mind if I ask what your average power draw looks like?

[-]

nacnud_uk@reddit

May I ask, is it okay to mix AMD and NVIDIA in the one system. I've a 3080 and a slightly older AMD card. The 3080 is 10gb. I think the other is 8gb.

[-]

awitod@reddit

Not at the same time for the same model runner. You could theoretically use both cards in the same machine at once but you would have to give them distinct workloads, e.g. one for LLM, one for image gen.

...but it sounds like a fun bad idea to me. :D

[-]

Sorry, I was trying to edit and get the links to the libs.
qwen-asr · PyPI

vibevoice · PyPI

[-]

Far-Low-4705@reddit

qwen3 ASR was just added to llama.cpp!

[-]

RaptorF22@reddit

How do the 2 rtx cards combine their vram? I thought that was only possible with 3090s.

[-]

awitod@reddit

It’s well supported by the nvidia drivers and mixing devices is pretty common.

Many things allow you to configure a specific GPU, all GPUs, auto or cpu only and I have been tweaking that for each thing as I go to get the most out of what I have.

It’s kind of like packing a car 😆

[-]

b0tm0de@reddit

hello. full encoder small decoder.safetensersor. i downloaded that from official repo. put it in vae folder for comfyui and its just giving error. do you know how that working for u? comfyui v18.5

[-]

awitod@reddit

I’m not using comfy, just stablediffuision.cpp which we build with the image.

[-]

puru991@reddit

What t/s are you getting?

[-]

awitod@reddit

Here is some normal output.

[32881] slot print_timing: id  2 | task 1170 | 
[32881] prompt eval time =     346.69 ms /   891 tokens (    0.39 ms per token,  2570.00 tokens per second)
[32881]        eval time =    3493.48 ms /    97 tokens (   36.02 ms per token,    27.77 tokens per second)
[32881]       total time =    3840.17 ms /   988 tokens
[32881] slot      release: id  2 | task 1170 | stop processing: n_tokens = 6454, truncated = 0
[32881] srv  update_slots: all slots are idle
[32881] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  proxy_reques: proxying request to model Qwen3.5-35B-A3B-Q5_K_XL on port 32881
[32881] srv  params_from_: Chat format: peg-native
[32881] slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = -1
[32881] slot launch_slot_: id  1 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist 
[32881] slot launch_slot_: id  1 | task 1269 | processing task, is_child = 0
[32881] slot update_slots: id  1 | task 1269 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 825
[32881] slot update_slots: id  1 | task 1269 | n_tokens = 0, memory_seq_rm [0, end)
[32881] slot update_slots: id  1 | task 1269 | prompt processing progress, n_tokens = 313, batch.n_tokens = 313, progress = 0.379394
[32881] slot update_slots: id  1 | task 1269 | n_tokens = 313, memory_seq_rm [313, end)
[32881] slot init_sampler: id  1 | task 1269 | init sampler, took 0.14 ms, tokens: text = 825, total = 825
[32881] slot update_slots: id  1 | task 1269 | prompt processing done, n_tokens = 825, batch.n_tokens = 512
[32881] slot update_slots: id  1 | task 1269 | created context checkpoint 1 of 32 (pos_min = 312, pos_max = 312, n_tokens = 313, size = 62.813 MiB)
[32881] slot print_timing: id  1 | task 1269 | 
[32881] prompt eval time =     276.73 ms /   825 tokens (    0.34 ms per token,  2981.27 tokens per second)
[32881]        eval time =     336.05 ms /    42 tokens (    8.00 ms per token,   124.98 tokens per second)
[32881]       total time =     612.77 ms /   867 tokens
[32881] slot      release: id  1 | task 1269 | stop processing: n_tokens = 866, truncated = 0
[32881] srv  update_slots: all slots are idle
[32881] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

[-]

Total_Activity_7550@reddit

Qwen3.5-27B. Nothing else that fits into 2xRTX 3090 works for my project. I use Qwen Code.

I also have my personal written todo webapp, it has MCP server. Gemma 31B is on par with Qwen3.5-27B.

[-]

Far-Low-4705@reddit

since u have two 3090's, u should try the new `-sm tensor` flag, it enables tensor parallelism.

it is still very much experimental, but you should keep an eye on it!

It will almost certianly make a difference for you since you are pretty much the target hardware

Normal: uses the Q4_K_XL quant and 262k context, runs at <23 t/s, surprisingly doesn't slow down too much at high context
Excessive: uses the IQ4_NL_XL quant and up to 524k context, runs at 13 t/s, slows down to half by around 150k context

Specs: AMD Ryzen 7 3700X, 32GB DDR4-RAM 2666 MT/s, 8GB Geforce RTX 3070 Ti, +64GB of paged pool on my fastest ssd (helps prevent total system lockup)

I ran multiple tasks that passed 200k context length and LLM was still coherent and functional! (can add features, fix bugs, etc)

[-]

mrtime777@reddit

cyankiwi/MiniMax-M2.7-AWQ-4bit (or cyankiwi/MiniMax-M2.5-AWQ-4bit) on 2xGB10 cluster..

[-]

Waarheid@reddit

Both have a common problem with understanding what exactly needs to be passed to the parameters of some tools.

Have you tried skills instead of tools? (i.e. just giving them a bash tool, and replacing all your tools with CLI commands that are each described in a markdown file)

https://mariozechner.at/posts/2025-11-02-what-if-you-dont-need-mcp/

I've found pretty small models are and to use skills effectively this way, particularly via the pi.dev agent

[-]

eikenberry@reddit

CLI tools + markdown describing their usage is pretty much an ad-hoc version of MCP.. why write custom cli commands + docs describing their use instead of writing your own MCP servers?

[-]

Waarheid@reddit

Why not do the more complicated and heavier implementation instead of simple and lightweight implementation? Great question...

[-]

eikenberry@reddit

Writing a new CLI command doesn't seem much easier than writing a new MCP server. I don't see how the CLI+docs is lighter weight than a MCP server? I guess it is the formality level... A small shell script command + quick docs would definitely be lighter weight where the MCP server requires more up front. Seems like an MCP server that simply provided a way to add shell-scripts + a quick doc to them would hit a similar level but then I guess that is pretty much what agent/harnesses are for and thus they already provide this.

TLDR; I think I've circled back around to agree with you, at least in the short run. IMO the verdict is out long term as I do think the MCP server idea has some merit.

[-]

Waarheid@reddit

MCPs became popular before harnesses that lived in the shell got big. Having a tool to run shell commands in those harnesses means just writing a shell command and an MD file is the simplest way to go, of course assuming your harness makes use of that (e.g. how pi does)

[-]

_derpiii_@reddit

I use this model with a Telegram bot as a background memory manager

Memory manager?

[-]

Zc5Gwu@reddit

+1 for minimax 2.7. I'm running with the following command on strix 128gb. Works well for agentic coding if your patient (do some laundry in the meantime). At 30k context getting about 16t/s and 50t/s prefill.

llama-server -hf unsloth/MiniMax-M2.7-GGUF:UD-IQ3_XXS --temp 1.0 --top-k 40 --top-p 0.95 -c 64000 --jinja -fa on -ngl 99 --no-context-shift -fit off --no-mmap -np 1

[-]

sn2006gy@reddit

works well as agentic? it's TERRIBLE lol.

BUT.. if you don't mind wasting time/electricity go for it. DON'T USE IT WITH AN API IF PEOPLE ARE READING THIS. YOUR COSTS WILL GO TO THE MOON

[-]

Local-Cartoonist3723@reddit

Didn’t get a chance to try this yet — you’re happy with it then? Any writeups?

[-]

Blues520@reddit

I'm using qwen3-coder-next on 48gb vram. Using an unsloth quant with 100k context.

Running in llamacpp with opencode.

-c 100000

--flash-attn on

--n-gpu-layers 999

--n-cpu-moe 24

--jinja

--temp 1.0

--top-p 0.95

--min-p 0.01

--top-k 40

[-]

Safe-Buffalo-4408@reddit

I've used 122B, but not minimax. But 27B performs better than 122B for be, it gets very obvious in Agent Zero between the two. 122B creates broken code and strange never ending loops, 27B is a work horse that slowly but firmly creates fully working software and seldom failing tool calls ans as good as never get in to loops longer than 2-3 iterations before it breaks out of them and continue it's work.

I've been pretty pleasantly surprised with Qwen3.5 122b (heretic mxfp4). I used next before, but I found it wasn't as consistent as I would have liked. I'm finding the later model pretty consistently sorts out complex coding tasks.

But maybe that's just my workflows. Primarily python recently.

[-]

orzechod@reddit

model: unsloth/Qwen3.6-35B-A3B-Q6_K

framework: llama.cpp + llama-server Vulkan

llama.cpp config:

n-gpu-layers: 99
cache-type-k: q8_0
cache-type-v: q8_0
ctx-size: 131072

inference settings: defaults

hardware: Ryzen HX 370 with 96GB of RAM (minisforum x1 pro)

coding agent: late

notes:

I don't have the memory bandwidth for a dense model like Qwen3.6-27B but 35B-A3B has been working out really well for me. I get a touch of overthinking but the model seems to work itself out eventually, and my inference speed (~22 tok/s using Vulkan) is fast enough that I don't really mind. using an orchestrator like Late seemed to be the missing piece for my stack though. prior to it I was using goose as my agent orchestrator and was having a ton of issues with tool calls silently failing. however Late is separating the architect model from worker models is doing wonders for me; I am able to make good progress on a variety of Javascript/Typescript, Python, and Lisp projects with a minimum of fuss now.

[-]

baliord@reddit

I use several models for different things; I run almost exclusively llama.cpp, and I use llama-swap to sit in front of my llama-server instances, providing around 32 different model choices. (I test different models regularly.) My go-to has been MoE models since \~GLM-4.6, as I can split them between GPU and CPU, and they handle it much better than dense models.

Right now I'm using GLM-5.1 (unsloth/GLM-5.1-GGUF)at 3 bit quantization, generally in non-reasoning mode, for creative writing. It's also my go-to for anything where I want to talk, but not do tool calls. At 10 t/s generation local, it's just way too slow at that, but it's human-speed for conversation or character-driven stories. It also picks up on character-definitions better than any other model short of Opus. (I've also used GLM-5.1 in the cloud for OpenClaw historically, because it's Opus-level smart, and because I want my agent to adopt the persona that is defined for it. These days I'm trying to use Qwen-122B local more consistently for OpenClaw, unless I need the smarts.)

For agentic use, Qwen 3.5-122B works surprisingly well, although not much of a 'personality'. I've run it at q4 (fully in GPU RAM) at \~50 t/s generation. I haven't needed to push up to q8 for it, and if I need much smarter I go cloud. Now the specific model I'm using there is HauhauCS/Qwen3.5-122B-A10B-Uncensored-HauhauCS-Aggressive. I also use that for image analysis, tagging and processing using mmproj-f16. The q4 isn't as good at image analysis as the q8, though. If I had to pick a model to stick in pure GPU RAM semi-permanently, it'd probably be this one, although I'd bump up to Q8 and let some of it sit in RAM.

I have an embedding model, but I don't really use it that much anymore. I was using Qwen3-Embedding-8B-Q8_0.gguf and a smaller Qwen reranker. I need to get back to this.

My system is a ASUS ESC4000A-E12 with a 32 core EPYC and 384GB of DDR5 RAM, 2xL40S for 96GB of GPU RAM; it sits in a SysRacks rack in my garage.

My basic config for each llama.cpp llama-server call in the llama-swap config expands to:

llama-server --prio 2 --mmap --log-timestamps --kv-unified --fit on --metrics \
  --jinja --temp 1.0 --min-p 0.01 --top-p 0.95 --threads-http 8 --mclock \
  --host 0.0.0.0 --port ${PORT} \
  --flash-attn on -ctk q8_0 -ctv q8_0

For non-reasoning, I add:

--reasoning-budget 0 --chat-template-kwargs '{"enable_thinking": false}'

I customize the -c {context-length} per model, because if you don't manually set a context length, --fit on will shrink your context to nothing, in order to fit the models, before it goes to RAM. :rage:

I also have a 'limited reasoning' for use cases where I want it to do reasoning, but I want it to not waste all it's time doing it. So I'll limit it to \~2048 tokens of reasoning, and leaving enable_thinking alone. (E.g. using the above 50 t/s on Qwen3.5-122B@Q4, that's about 41 seconds of reasoning.)

Hope that helps!

[-]

Viperus@reddit

I tried spinning Qwen 3.5 122B with Q6 and 64k context on a Jetson Thor 128GB but this doesn't work well without some kind of orchestration, it seems.

When you say "For agentic use, Qwen 3.5-122B", how to do orchestrate agents?

[-]

baliord@reddit

Not sure what you mean by 'some kind of orchestration'? I'm running llama.cpp which offers an http endpoint, which I then configure in various other tools (Silly Tavern, and OpenClaw are the two main ones I use).

The thing that makes Qwen3.5-122B good for agent use is its tool-calling smarts. You do need an agent framework to use it, of course. I like OpenClaw, and use it extensively, but I've heard good things about Hermes, and others.

Really the question is what do you want to do with it? What need do you have that you'd like addressed?

[-]

Viperus@reddit

I ran a test just this weekend to convert the frontend from angularJS to blazor server, to do it one view at a time, test etc. but it failed horribly, mostly due to low context and too much compacting.

So, I'm trying to figure if I can split the work with subagents etc. Can it even be done with 64k context, or use Q8 or even Q6 context with 128k context, or should I use a smaller model etc.

[-]

Nindaleth@reddit

I customize the -c {context-length} per model, because if you don't manually set a context length, --fit on will shrink your context to nothing, in order to fit the models, before it goes to RAM. :rage:

Would this option be of any help to you?

-fitc, --fit-ctx N minimum ctx size that can be set by --fit option, default: 4096

[-]

baliord@reddit

Interesting! I hadn't seen that, for some reason. Yes, that will help a lot!

[-]

danigoncalves@reddit

That's a beast what you have there :D

[-]

baliord@reddit

It is! My wife calls my ML server my mid-life crisis, 'cause it cost about as much as a car, but at least I'm not out there trying to race it. And it sounds like the blower on a race car when doing training or inference. One person described it as 'an HVAC system with opinions'.

Sorry I don't have deep testing, but I tried 5-10 other models and there was always lots of back and forth with more changes, errors, mistakes, but with these models I don't feel that, so I just stuck with them

L:

Agentic tasks:

I run Unsloth's MiniMax-M2.7-UD-Q5_K_S on a dedicated headless Ubuntu node with Ryzen9 5950, 2xRTX3090, 128GB DDR4 with the following params:

CUDA_VISIBLE_DEVICES=0,1 ./llama-server \

-m /data/models/MiniMax-M2.7-UD-Q5_K_S/MiniMax-M2.7-UD-Q5_K_S-00001-of-00005.gguf \

--host 0.0.0.0 \

--port 8080 \

--alias MiniMax-M2.7-Q5KS-64k \

--ctx-size 65536 \

--parallel 1 \

--n-gpu-layers auto \

--n-cpu-moe 54 \

--split-mode layer \

--tensor-split 5,1 \

--cache-type-k q4_0 \

--cache-type-v q4_0 \

--batch-size 4096 \

--ubatch-size 2048 \

--threads 16 \

--threads-batch 16 \

--flash-attn auto \

--reasoning off \

--reasoning-format none \

--reasoning-budget 0 \

--mlock

Prefill / prompt speed: about 11.2 tok/s

Decode speed: about 8.5 tok/s

The GPUs are power limited to 280W.

[-]

youcloudsofdoom@reddit

S - 8GB

I'm getting great mileage out of Qwen3.5-35B-A3B-UD-Q4_K_L. With this I'm squeezing around 600 p/p and 30 t/s out of my RTX 4070 Laptop (!) edition, 8GB VRAM. Very usable, and the competency on single coding tasks has been very good so far. I'm currently experimenting with using this in a local Hermes set up, but it's early days yet.

Here's my llama.cpp settings, lots of back and forth on these...

-ngl 99 ^
  -fa on ^
  --n-cpu-moe 45 ^
  -c 198192 ^
  -np 1 ^
  -t 6 ^
  -tb 12 ^
  -b 4096 ^
  --ubatch-size 2048 ^
  --cpu-mask 0x555 ^
  --prio 3 ^
  --jinja ^
  --cache-type-k q8_0 ^
  --cache-type-v q8_0 ^
  --mlock ^-ngl 99 ^

https://huggingface.co/mradermacher/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF/discussions/1

[-]

guiopen@reddit

There are still some problems with tool calls leaking in reasoning

[-]

truthputer@reddit

== Coding Only! - I only use LLMs for coding. Home workstation (built well before the RAM-pocalypse):

ASRockRack ROMED8-2T / AMD EPYC 7C13 64-core + 512GB RAM / Radeon RX 7900 XTX 24GB
Windows 11 Pro / llama.cpp, built with Vulkan support / OpenCode.

== Current main coding LLM: Gemma 4. It runs at high speed with a big context window - Benchmarks:

Gemma 4 26B-A4B - PP512 - 3171 t/s
Gemma 4 26B-A4B - TG128 - 132 t/s
Interactive prompt speed in apps: 100 t/s with 262144 context.

I launch with:

build\bin\Release\llama-server.exe -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL ^
    --threads 128 --ctx-size 262144 --cache-reuse 262144 ^
    --reasoning on --no-mmproj --parallel 1 --flash-attn on ^
    --temp 1.0 --top-p 0.95 --top-k 64

This hasn't been out long, but I've already noticed it sometimes makes mistakes and isn't as precise as frontier models like Claude. But it's fast and reasonably capable, just verify the work it does.

== I'm not using (until the bugs are fixed): Qwen 3.5 35B-A3B, I discovered the following problems:

Qwen 35B benchmarks at PP512 - 3148 t/s and TG128 - 144 t/s - but I was unable to get more than 40 t/s when running apps. Don't know why, changing settings makes no difference.
Prompt caching / reuse is currently broken with Qwen models in llama.cpp. In the middle of a chat, it will re-process the entire conversation before continuing which is frustrating.

== I am experimenting with: MiniMax 2.7 226B-A10B - Benchmarks:

MiniMax 2.7 226B-A10B - PP512 - 27 t/s
MiniMax 2.7 226B-A10B - TG128 - 5 t/s
Interactive prompt speed in apps: 8-10 t/s

I launch with:

build\bin\Release\llama-server.exe -hf unsloth/MiniMax-M2.7-GGUF:UD-Q4_K_XL ^
    --threads 128 --ctx-size 196608 --cache-reuse 196608 ^
    --reasoning on --no-mmproj --parallel 1 --flash-attn on ^
    --temp 1.0 --top-p 0.95 --top-k 40

[-]

Objective-Stranger99@reddit

GLM 4.7 Flash REAP 23B A3B UD Q5 K XS

[-]

the_auti@reddit

GLM 5.1 FP8 on sGLang 4xB300 Cluster 200k Context 32k Output

Have not finetuned this yet as this is an experiment and costs $30/hour to run.

Can run a dozen parallel agents at extreme speed.

Project: Convert 500k line Node / Express / Pug codebase to Go and React.

14 Hours Run Time

170k lines of Go 35k lines of TS

Output on par with Opus 4.6

Will run e2e testing tomorrow but the initial code review (can you call it that?...code glance) using Opus 4.6 is extremely positive.

Please note this setup is not for the faint of heart. It cost close to $500 just to get this running on runpod. It is still not running "Properly" but it was enough for our experiment.

In the coming weeks we will be testing MiniMax as well.

[-]

Travnewmatic@reddit

my daily for Hermes, agent-zero, and OpenCode:

[Unit]
Description=Llama.cpp GPU Docker Service serving Qwen
After=docker.service
Requires=docker.service

[Service]
Restart=always
ExecStartPre=-/usr/bin/docker stop llama-server
ExecStartPre=-/usr/bin/docker rm llama-server
# Start the container with GPU support
ExecStart=/usr/bin/docker run --name llama-server \
  --gpus all \
  -p 8080:8080 \
  -v /var/lib/models:/root/.cache/huggingface/ \
  ghcr.io/ggml-org/llama.cpp:full-cuda \
  --server \
  --host 0.0.0.0 \
  --port 8080 \
  -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
  --temp 0.5 \
  --top-p 0.95 \
  --min-p 0.0 \
  --top-k 20 \
  --timeout 3600 \
  --ctx-size 130000 \
  --fit on \
  --flash-attn on \
  --metrics

[Install]
WantedBy=multi-user.target

also running an embeddings model (on the CPU) for agent-zero:

[Unit]
Description=Llama.cpp Embedding Service
After=docker.service
Requires=docker.service

[Service]
Restart=always
ExecStartPre=-/usr/bin/docker stop llama-embeddings
ExecStartPre=-/usr/bin/docker rm llama-embeddings
ExecStart=/usr/bin/docker run --name llama-embeddings \
  -p 8081:8081 \
  -v /var/lib/models:/root/.cache/huggingface \
  ghcr.io/ggml-org/llama.cpp:full-cuda \
  --server \
  --host 0.0.0.0 \
  --port 8081 \
  -hf jinaai/jina-embeddings-v5-text-small-retrieval-GGUF:Q8_0 \
  --batch-size 8192 \
  --ubatch-size 8192 \
  --embeddings \
  --n-gpu-layers 0 \
  --device none \
  --pooling mean \
  --metrics

[Install]
WantedBy=multi-user.target

been working fairly well. open to suggestions!

running a single Nvidia A10, 32G system memory. 9800X3D.

Eyelbee@reddit

Why are you running it on that setup when qwen 3.5 27B exists? Would be significantly higher quality.

[-]

false79@reddit

Had tool issue with cline --tui, I quit on qwen.

If you could share your 3090 setups, models you're using, and configurations I would truly appreciate it!

Thanks folks!

[-]

overand@reddit

I'm happy to help! I have a dual 3090 setup, but was on a single 3090 for a while.

Are you on Windows? Linux?
Are you sure the card works correctly?
What tool are you trying to use?
Ollama?
llama.cpp?
KoboldCPP?
vLLM?
LMStudio?
Something Else?

[-]

brenden77@reddit

Hey, thanks!

Linus (Ubuntu)
Yes. Small models seem to work fine, but obviously lack accuracy.
Ollama, but i'm open to suggestions.

[-]

overand@reddit

Well, you'll have better luck with "fine tuning" in llama.cpp rather than ollama. But, you'll either need to have docker working w/NVidia / CUDA, or you'll need to compile llama.cpp to make a version you can run. (It's not all that complex, but it can be daunting if you've never done it before, or don't have much CLI experience)

Anyway, get / install "nvtop" and run it to take a look to see if you have things already running and using GPU memory - that'll be a big factor in possible issues.

Once you're clear there: if you want to try this with ollama, my suggestion is:

ollama run hf.co/unsloth/Qwen3.6-27B-GGUF:Q4_K_M (or just "ollama pull ...") - though the llama.cpp way would be:

llama-cli -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M

Can you go into more detail of what you mean by "it's not working"

[-]

Jazzlike_Arm6363@reddit

hello, i have hp omen max , AMD Ryzen™ AI 9 HX 375 (up to 5.1 GHz max boost clock, 24 MB L3 cache, 12 cores, 24 threads)

AMD Ryzen™ AI (55 NPU TOPS)

GeForce RTX™ 5080 Laptop GPU (16 GB GDDR7 dedicated) 175 watt tgp

32 GB ram DDR5-5600 MT/s

can i run comfyui ? and which local llm is the best for my laptop ?
i,ll use it for everything even creative and rpg but mostly for work and study and learning and searching...etc you name it

[-]

hmmmmm_nl1@reddit

Comfyui is just a program, a scaffold to run models, so yes you can run that. Then it depends what model you are trying to run. You can download models straight from the software, including preset configs, watch a comfyui tutorial on youtube and you should be good to go.

As for general use, download lmstudio and use the model search option, that will give you every model on the planet (almost), and show you if it will fit on your gpu memory. See screenshot, on the left side of the download button you see 'full gpu offload possible'. Best way to start i recon, good luck!

[-]

Jazzlike_Arm6363@reddit

i downloaded lm studio and its easy to deal with , downladed qwen latest release 3.6 and it doesnt generate images , is there any model with high tokens that can generate images and uncensored ( not nsfw related )

[-]

TacticalGhosting@reddit

Looking to get into local models. already set up LM studio and connected it to Anything LLM.~~~~

Im looking for models that can run on my 8gb rtx 3070, 32gb ddr4, 5600x pc.

I'm looking for specialist models now. One dedicated to coding.

Then one dedicated to general intelligence.

One for creative storytelling.

All of them need.to be able to use tools. And hopefully all the can be almost or entirely inside the 8gb vram...

Especially the non coding ones. And hopefully can be used from ALLM as well.

[-]

hmmmmm_nl1@reddit

Just wanted to share my latest inference result for project 'Moriarty' using cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit on a watercooled RTX4090. Getting up to 850 tokens/sec with parallel agents using langgraph->vllm.

The app is build as a general research agent, taking 1 subject like: 'research the history of the Audi RS division' and then running with it for a set amount of time. It will generate sub-topics, and fan out up to 16 langgraph agents to a remote pc for the inference. These will also rate content quality and loop back to the planner. Starting with cheap search options like ddg en searxng, and upgrade to exa deep research to fill in gaps untill all subject are covered.

Locally its all running on a macbook m2, using a bunch of tools to scrape sites, frame analyze youtube videos using vsmol, read pdf files, whisper video and audio to text, translate, run embedding with ollama, save to vector db (lancedb) for semantic search etc. No real usecase yet, just having some vibe lols. The inference is done on a remote pc (windows 11+wsl2/ubuntu).

I've tried multiple models on the PC side (RTX4090), mainly Qwen 3.6 and Gemma 4 lately, both the dense models and MOE variants. In the end both were more then adequate, but MOE gave better speed, and Gemma was 10% faster then Qwen with similar results. Tried fitting different quants, some Q5 models fit as well, but gave no better output quality, and Q4 gave me more cache room, now running max 16 agents with 32K context (depending on amount of sub-topics agents are created). Some results from the logs, ran some random 5 minute benchmarks during research runs:

Benchmark Run #4 Analysis, running 15 agents:

Metric Minimum Maximum Average

Aggregate Speed (Total TPS) 628.0 843.4 756.0 tokens/s

Per-Agent Speed (TPS/Agent) 46.2 64.6 54.0 tokens/s

GPU KV Cache Usage 50.0% 70.8% 58.5%

Max Concurrency — 15 Agents —

Benchmark Run #5 Analysis

'In this run, we saw a lower concurrent load (8 agents instead of 15), which gave us a great look at how the server behaves when it's not fully saturated.'

Metric Minimum Maximum Average

Aggregate Speed (Total TPS) 257.9 659.2 414.2 tokens/s

Per-Agent Speed (TPS/Agent) 72.6 181.6 90.8 tokens/s

GPU KV Cache Usage 6.3% 32.5% 18.7%

Max Concurrency — 8 Agents —

'Observations: Lower Load, Higher Speed: With only 8 agents running, each individual agent received text nearly 2x faster (90.8 TPS avg) than during the 15-agent peak.'

Prompt Processing (Prefill) Performance

Metric Value

Peak Prompt Speed 5,335.4 tokens/s

Average Prompt Speed 1,500 - 2,400 tokens/s

Latency per Prompt. \~200ms - 500ms (for typical Moriarty instructions)

Ive been messing around with MTP aswell, and some turbo and rotorquant, but for my research usecase this multi agent setup (heavily relying on prefix cache) is giving me the best results. Maybe in the future i can add it on top, that might reach 1000TPS? :)

**vLLM flags:**
- `--max-model-len 32768` — 32K context for long article processing
- `--max-num-seqs 16` — match `SWARM_CONCURRENCY` in `.env` (16 concurrent agents)
- `--gpu-memory-utilization 0.93` — leaves headroom for WSL/Windows overhead
- `--kv-cache-dtype fp8` — ~50% KV cache compression, critical for 16 concurrent agents
- `--dtype half` — required for AWQ/Marlin compatibility
- `--enable-prefix-caching True` — optimizes repetitive system prompts across agents
- `--limit-mm-per-prompt image=0` — text-only, skips vision encoder, saves VRAM

Sidenote on the dflash-mlx model:
Im only using this at the last step, taking in massive context and forming a documentary script of sorts, using all articles and files to create 1 coherent story. Wanted to play around with dflash, but soon learned that would not work well on my system, ended up with the regular gemma-4-26B-A4B-it. This keeps both system on the same model, which makes things easier in the future. On an M2 Max chip this is running @ 100tps.

[-]

Series-Curious@reddit

OCR

[-]

archieve_@reddit

PaddleOCR

[-]

Disonantemus@reddit

OCR

Pure OCR Models supported by llama.cpp (preference order)

Model	Param
HunyuanOCR	0.5B
GLM-OCR	0.9B
PaddleOCR-VL	0.9B
dots.ocr	3B
LightOnOCR	1B
Deepseek-OCR	3B

General models with vision doing OCR (S range: 5GB VRAM):

Qwen3-VL-4B
Qwen3.5-4B

My tests prompts were:

convert to markdown: simple documents with tables and bullets
convert to mermaid: charts

Very good for a 31b and thankfully not so censored. Plus it has tuning potential.

Sadly the base is a bit dumber and it seems like context memory use is on the heavier side. Still, it's great to have a fun model for once.

[-]

ActivelyCoping@reddit

I’m new here so forgive me for the dumb question but I didn’t know local llms were censored, I thought that the whole point of running an llm yourself was to avoid just that. If the code is open source can you just “jailbreak” it, or would that be prohibitively hard somehow.

[-]

a_beautiful_rhind@reddit

The censorship isn't related to the code but how the model is trained. You can jailbreak it or sample around it, but the refusals are still in there. On API you get what you get.

Honestly same, Gemma 4 31B made me retire Skyfall 31b, Valkyrie 49b and Anubis 70b basically overnight. I was actively using all three for creative/RP up until I loaded Gemma 4 31b for the first time and just... never went back. The context coherence is what gets me most, it actually remembers what's been established and builds on it properly instead of treating every few exchanges like a soft reset. It referenced a setup/foreshadowing from like 50k tokens earlier that I had completely forgotten about myself and used it to strengthen a scene in a way that felt very intentional and earned. I have never seen a model at this size do that, only frontier models have been able to pull something like this off in my experience.

Genebra_Checklist@reddit

Gemma 4 26B A4B. I was working on a Gemma 3 fine tunning when Gemma 4 launched. Man, we can't even compare other models for creative writing. Works wonders in pipeline with few shots style exemples.

[-]

Caffdy@reddit

Works wonders in pipeline with few shots style exemples

can you explain this part? how do you setup the pipleline?

[-]

Traditional_Chart970@reddit

I want to try it with my new Macbook Pro 64GB ram. I can't try it my MBA M3 16GB :(

[-]

Top-Rub-4670@reddit

I can't try it my MBA M3 16GB :(

You can run Gemma 4 26B A4B Q4 just fine on that. It won't fit entirely in RAM but as a MoE they suffer a lot less than dense models when that happens.

I've personally run 26B on a machine with no GPU and only 16GB total RAM and still got 3tg/s, enough to play around with it.

It handles context well up to 128k tokens, great for reviewing medium scripts. Pair with DeepSeek-Coder-V2-Lite-Instruct 16B Q3 (M category) if you have \~10GB for tougher tasks.

Sources: https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct, official Ollama library benchmarks.**Coding/Programming (S: \<8GB VRAM)**For lightweight coding assistance on low-end hardware, I recommend Qwen2.5-Coder-7B-Instruct Q4_K_M (\~4GB VRAM). It's excellent for code completion, debugging, and explanations in languages like Python, JS, C++. Benchmarks show it outperforming similarly sized models in HumanEval (pass@1 \~65%) and MultiPL-E.Example usage with Ollama: `ollama run qwen2.5-coder:7b-instruct-q4_K_M`It handles context well up to 128k tokens, great for reviewing medium scripts. Pair with DeepSeek-Coder-V2-Lite-Instruct 16B Q3 (M category) if you have \~10GB for tougher tasks.Sources: https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct, official Ollama library benchmarks.**Coding/Programming (S: \<8GB VRAM)**

For lightweight coding assistance on low-end hardware, I recommend Qwen2.5-Coder-7B-Instruct Q4_K_M (\~4GB VRAM). It's excellent for code completion, debugging, and explanations in languages like Python, JS, C++. Benchmarks show it outperforming similarly sized models in HumanEval (pass@1 \~65%) and MultiPL-E.

Example usage with Ollama: `ollama run qwen2.5-coder:7b-instruct-q4_K_M`

It handles context well up to 128k tokens, great for reviewing medium scripts. Pair with DeepSeek-Coder-V2-Lite-Instruct 16B Q3 (M category) if you have \~10GB for tougher tasks.

Gemma 4 E4B on Mac Mini M4 (16GB) - Benchmarks (oMLX vs Unsloth)

I've been benchmarking the Gemma 4 E4B models on a Mac Mini M4 (16GB) to find the optimal configuration for coding and technical assistance. The following results compare the standard oMLX quants against the Unsloth UD-MLX (Dynamic) versions using the oMLX engine with Paged SSD KV caching.

Performance Comparison (Generation TPS):

Context	oMLX 4-bit (Std)	Unsloth 4-bit (UD)	oMLX 8-bit (Std)	Unsloth 8-bit
1k	31.2	26.1	18.8	18.7
16k	24.6	21.2	14.9	14.9
32k	20.8	16.2	9.1	8.8

Technical Observations:

The oMLX Standard 4-bit is the most efficient choice for a daily driver. It maintains over 20 TPS at 32k context with a minimal memory footprint (\~4.5GB), allowing the system to handle other heavy processes without lag.

The Unsloth UD-MLX 4-bit offers better logical reasoning and native vision support, though it carries a 20% performance penalty. It is the preferred model for vision-centric tasks or complex debugging where precision is prioritized over speed.

Regarding the 8-bit versions (both oMLX and Unsloth), they perform nearly identically. However, on 16GB hardware, they hit a hard limit at high context. As soon as oMLX begins aggressive SSD paging at 32k, speed drops to \~9 TPS, making 4-bit the only practical option for long-context workflows on this machine.

In summary: Use oMLX 4-bit Standard for speed and general coding; switch to Unsloth 4-bit UD for vision and high-level reasoning.

[-]

Trick-Assignment-828@reddit

How are you runing unsloth?

[-]

MerePotato@reddit

I highly recommend against 4 bit quants for a model this small and knowledge dense

[-]

gandhi_theft@reddit

What t and top_p do you like with this model for coding?

[-]

__ahdw@reddit

[-]

Top-Rub-4670@reddit

I can confirm that Gemma 4 is very good in Finnish and Swedish in my tests. It's also good at French. This is less impressive, but what is impressive is that it knows regional dialects, which is usually entirely lost in other models (you get neutral Parisian French with a hint of mimicry of the dialect you specifed, if you're lucky).

[-]

WhoRoger@reddit

S for pool people

Still loving Smollm3 3B, I think it's overall the best small model to play with
Granite 4 7B and 1B, underappreciated. 1B can code up some basic things better than Qwen 3.5 2B, and 7B is MoE so much faster than anything at that size
Qwen 3.5 4B is incredible for vision
LFM 2.5 1.2 Thinking, a tiny reasoning brain
InternVL3.5 1B, a tiny competent vision model
Gemma4 E4B shows how far have small models come

[-]

OrganicHalfwit@reddit

3060ti 8Gb with 32gb system Ram.

Been using Qwen3.5-35B-A3B-Q4 for large text chains with multiple files and comparison. Got upto a context length of 110k, but it's quality was dropping significantly. \~30t/s

For small and fast questions about the models themselves I'm using Qwen3.5-9B-Q4 \~20t/s

I am currently running on ollama with jan.ai ontop so i'm trying to move over to llama.cpp with webui.
All very new to this though and want to get into image, audio, and video gen.

[-]

salmon37@reddit

Hey, how do you share gpu ram with system ram? I'm only getting into local llms and I've la 3080 with 10gb ram and I've been able to run 2 bit quantized models with llama.cpp, but I didn't know you could use regular ram with these models

[-]

OrganicHalfwit@reddit

So from my understanding is that only MoE models can share between ram. Take the Qwen3.5-35B-A3B, its not actually a 35B parameter model, its 7 different 3B parameter models which are all honed for multiple different tasks. MoE (Mixture of experts) allows the models that aren't being used to wait in system ram while the singular 3B model which is most relevant for the task swaps in and out.

So effectively you have multiple different brains, which are all pretty smart, waiting in the sideline to sub in for one an other. This lowers total speed on call (but tokens generate still quite fast), it also means that there's alot of excess room in your vram so your context length can be much larger.

However, you are still using 3B parameter models so they can be fair stupid. On benchmarks the 9B basically always beats the MoE, but because its so large I can only use that with a small context window of 4k tokens

with 10gb of vram you have just a little bit more wiggle room than I do so you can play around a bit more. Although for huge conversations (100k +) i think the MoE's are good enough.

As to "how to use", my front end jan.ai does automatic allocation, but it's suboptimal so I have to play around a bit. Specifically enabling "keep all experts in CPU" and I have -1 on offloading model layers to GPU, which honestly I cant remember why i did that as it's kinda counter intuitive.

I'm still learning too at this point. Anyway, hope this helped!

[-]

salmon37@reddit

Helped a bunch, thanks so much for the detailed response!

[-]

Nabushika@reddit

gguf files (/llama.cpp) is designed to be able to split computation between CPU and GPU (although I've heard ik_llama might be faster). Any model you download will spill over into system ram if you don't have enough vram (with appropriate slowdown).

[-]

Objective-Stranger99@reddit

Qwen3.5 35B A3B UD IQ4 XS for deeper thinking.

Gemma 4 26B A4B UD Q4 K XL for better conversational flow.

[-]

Hydroskeletal@reddit

Research and ingestion projects is where I'm working local; for coding I've not seen enough to pull me away from Claude/Codex. But when you're dealing with a flood of data, way too easy to blow your token budget.

A couple of M4 Macs for me and I'm all in on Gemma4 right now, 31B and E4B. Qwen3.5-35b-a3b was just the hands down winner but I kept coming back to Gemma and it depends on what you want to put in.

If you hand both Q35-27b and g4-31b a book and say "Write me a detailed book report", Qwen is going to give you the better, longer report. By default Gemma is lazy. You need to tell it to not spare the thinking budget, make sure you're giving it max tokens for output and really tell it all the things you expect in the book report. Then you take the detailed prompt and give it to both and Gemma will have more details in a more concise format. Qwen will repeat the same ideas phrased differently.

Same thing goes for planning. If you tell Qwen "Give me a plan to do X", Qwen does a better job of intuiting what you want. But be specific about what the plan needs to do, the metrics, goals, outcomes, things to account for etc. and Gemma is better.

Where Gemma absolutely crushes Qwen for me though is my source discrimination tests. Qwen is eager to include crap and then hedge that it might be crap, or is perhaps more "crap curious" meaning it will look at something with a crappy abstract and then after wasting time decide it was in fact crap.

So the workflow is using the big dense Gemma as the 'brain', doing the big data work and then delegating out to a small model for very constrained tasks ("Which of these 3 documents meet criteria X?") and e4b really does quite well at this. I was using gpt-oss20b before and e4b is just strictly better. Caveat is that you really need to use thinking. I tried q3.5-9b but it was often slow enough that it didn't make sense. I should probably do more testing for q3.5 at the 4b size.

[-]

CatEatsDogs@reddit

Using three llms through my telegram bots -->n8n instance: 1. Parakeet (hope I correctly typed it) if I want to ask something in voice 2. qwen3.5:35b-a3b-q4_K_M if input was text or recognised speech 3. gemma3:27b-it-qat if input contains image

2 and 3 are running on the server in ollama using 12gb rtx3080. Parakeet is running on separate lenovo 720q fully on cpu.

Qwen is mostly used for translating something into my native language. Gemma is used the same way but with images.

I tested image processing in qwen by I didnt like it. Gemma is "smarter". I tried to post random screenshots from random youtube dron flies and gemma recognized more places successfully.

Also tried to use newest gemma4 26b but struggle to disable thinking in n8n.

[-]

Farmadupe@reddit

I've got a 3090 + 2060 at home, that's enough for models in the 30B range at \~q5 with llama.cpp

* For general agentic work, qwen3.5-27b at \~Q5 is just about on the right side of competent, with some handholding on MCPs that its given. But with a small set of tools, and the right carefully crated prompts, it can do useful stuff independently.
* For batched logging/classification, I switch to the smallest qwen3.5 possible that I can use. qwen3.5-9b and below fit entirely in the 3090 at fp8 so can run under vllm, which is way way faster and less buggy.
* for some tasks, I can switch to 122b or 397b, but they're orders of magntiude slower so don't get used much.
* qwen3.5 has SOTA-level image comprehension. there's no need to pay money for image classification tasks.
* gemma4 31b is roughly comparable to qwen3.5-27b but not quite as good. The only task that it really beats qwen3.5 is video comprehension. I can stuff 100k tokens into context and get better groundings than qwen3.5 series. The default persona is a bit more pliable than qwen3.5, which can be a bit robotic.

* honestly, at 32Gb vram, I don't think I can replace opus/codex for agentic coding yet. I make my paycheck of coding and qwen3.5-27b is too slow and not brainy enough for coding tasks.

[-]

Longjumping-Move-455@reddit

If you didn’t get an answer qwen3.6 35b using llama.cpp :). I suggest using unsloths gguf. https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

[-]

muyuu@reddit

May thread when?

[-]

rm-rf-rm@reddit (OP)

Speciality

(includes medical, legal, accounting, math etc.)

[-]

Disonantemus@reddit

OCR

HunyuanOCR

Supported by llama.cpp
Small/Fast
Accurate

Use case: convert documents with tables or charts to markdown/mermaid, that is hard for traditional OCR (Tesseract).

[-]

UpsetEmotion6660@reddit

Edge AI / IoT inference on constrained devices:

S (<8GB): For always-on sensor inference on MCU-class hardware (STM32N6, ESP32-S3, Syntiant NDP120), you're not running LLMs — you're running quantized TinyML models (keyword spotting, anomaly detection, vibration classification). TensorFlow Lite Micro and ST's NanoEdge AI Studio are the practical tools here.

M (8-32GB): This is where edge inference gets interesting. Running quantized 4B models on Jetson Orin Nano or RPi5 + AI HAT for real-time vision, predictive maintenance, or local NLP. Gemma4 e4b quantized is genuinely usable for on-device agentic tasks in industrial IoT — local decision-making without cloud round trips.

The underappreciated angle for this community: the biggest constraint for edge AI isn't model quality anymore — it's the orchestration layer. How do you push model updates OTA to a fleet of thousands of devices running different hardware? How do you handle inference when connectivity is intermittent? The model is the easy part; the distributed systems around it (connectivity management, fleet OTA, telemetry collection) are where most deployments actually struggle.

Tyrannas@reddit

Sure so I downloaded the gguf model and served it locally with llama.cpp llama-server, then I use this basic snippet:

from pathlib import Path
from churro_ocr.ocr import OCRClient


from churro_ocr.providers import OCRBackendSpec, build_ocr_backend, LiteLLMTransportConfig


backend = build_ocr_backend(
    OCRBackendSpec(
        provider="openai-compatible",
        model="local-model",
        transport=LiteLLMTransportConfig(
            api_base="http://127.0.0.1:8080/v1",
            api_key="dummy",
        ),
        profile="stanford-oval/churro-3B"
    )
)


image_path = "./images/test.png"
page = OCRClient(backend).ocr_image(image_path=image_path)from pathlib import Path
from churro_ocr.ocr import OCRClient


from churro_ocr.providers import OCRBackendSpec, build_ocr_backend, LiteLLMTransportConfig


backend = build_ocr_backend(
    OCRBackendSpec(
        provider="openai-compatible",
        model="local-model",
        transport=LiteLLMTransportConfig(
            api_base="http://127.0.0.1:8080/v1",
            api_key="dummy",
        ),
        profile="stanford-oval/churro-3B"
    )
)


image_path = "./images/test.png"
page = OCRClient(backend).ocr_image(image_path=image_path)
```

```
I use the stanford-oval profile to get structured XML output but you can also have raw text by changing the profile.

[-]

MuDotGen@reddit

How does this do for modern official documents? Languages? Such as Japanese? I'd been using PaddleOCR for that use-case so far, but a 3B Q4 model seems tempting.

[-]

Tyrannas@reddit

Haven't tried on modern, but I'm pretty sure you can find better since Churro is a Qwen2.5 fined tunes on 100k historical documents. Maybe look on https://huggingface.co/collections/ggml-org/ocr-models to find other options ?

[-]

CodeCatto@reddit

What are the best coding models to run on a 12GB RTX 5070Ti?

[-]

tthompson5@reddit

I have a RTX 4070 Ti (also 12GB). I'm NOT a coder, but people on here say Gemma-4-26b codes well, and I got the UD-IQ4_XS quant from unsloth to run on my machine (https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF) at about 40t/s on startup (slower for longer contexts). It's probably worth a try for you.

I also successfully got it to write a couple of bash scripts for me (not big coding projects) and refactor a simple R script. From my experience with it, it seems reasonably competent. I use llama.cpp to serve the model.

Tuning it for speed versus hogging all the system RAM is still ongoing, but if you want it, I can share my full current start-up script for it. A lot of experimentation and trail-and-error has gone into it, and I'm still experimenting with tuning it. I'm not using the mmproj file to enable vision by default. Unless you need the vision for your use case, it's better to leave it off and save yourself the VRAM/RAM.

Anyway, these are the important flags I'm currently using, and I hope they'll give you a good starting point if you decide to try the model.
--ctx-size 100000
--parallel 1
--cache-type-k "q8_0"
--cache-type-v "q8_0"
--flash-attn on
--swa-checkpoints 2
--cache-ram 2048
--checkpoint-every-n-tokens 16384
--defrag-thold 0.1
--temp 1.0
--top-k 64
--top-p 0.95
--min-p 0.02
--repeat-penalty 1.1
--no-mmap
--cpu-moe 12
--jinja
--chat-template-kwargs '{"enable_thinking":true}'

Screenshot from running it on my machine with the car wash test question:

[-]

mr_357@reddit

I tried running gemma4-26b-Q4-K-M on my 5070 and while it's amazing for creative tasks and decently fast, it really struggles for coding if provided tools (like running it through copilot, cline or other vscode extensions). I had a lot more luck with mistral's devstral, but obviously it's much much slower because it doesn't fit in the VRAM

[-]

tthompson5@reddit

Fwiw, I did write my comment above before Qwen3.6 came out (either version), which is also worth trying.

I'm still having decent luck with Gemma-4 writing me scripts and such (including debugging them). Last night I had it write me a script to send a bunch of battlemap images (one at a time) to llama-server and get back descriptions and tags and save them as properly titled json files. All that said though, I have heard more from some redditors that say Gemma-4 struggles with a true codebase, which may be a result of how its attention window works.

I don't know if you care to tinker more with Gemma-4, but if you do, the jinja chat template is still broken for tool calls even though Google updated it again just a few days ago. You can try the fix detailed here: https://www.reddit.com/r/LocalLLaMA/comments/1syps6i/i_stumbled_on_a_gemma_4_chat_template_bug_for/ After I started using that template, the number of failed tool calls from Gemma dropped noticeably.

[-]

mr_357@reddit

yeah I just tried out qwen 3.6-35b-a3b and it's pretty good, it doesn't always make correct code on the first try, but with some guidance it can get stuff done much faster than me typing it

I'll try out the fixed chat template for gemma, but other than the tool calls it also sometimes starts looping in on itself and doesn't seem to have good training data for stuff I want to use it for (game dev)

[-]

Sergei-_@reddit

hi, are is this app are you using to chat?

[-]

tthompson5@reddit

You mean where did I get the screenshot? It's part of llama-server. I can talk to the AI by opening my web browser and going to localhost port 8080 (or whatever you set when you launch the server).

Right now, I'm mostly using Mistral's Vibe to actually work with the AI, but there are better harnesses depending on what you want to do.

[-]

PairOfRussels@reddit

I've been running llama.cpp with qwen 3.5 (now 3.6) 35B A3B model. I started with a context size that I need (70K context size for example) put all the layers on GPU, then put as many MOE experts on CPU/DRAM until I have all the model and context fitting in the 10GB VRAM (and none in the 24GB shared VRAM.. because as soon as I share between VRAM and Shared VRAM aka DRAM it slows to PCIE transfer speed).

This gets me about 100t/s prompt eval and 30t/s token generation.

Is there a better model and start params to use for a 3080 RTX to do agentic coding with Cline?

[-]

quickreactor@reddit

10GB what quant are you using to fit it all in 10GB?

[-]

AreaExact7824@reddit

Best agent for subagent

[-]

Party-Log-1084@reddit

Building a completely local, uncensored RAG setup for sysadmin tasks (ingesting logs, docker-compose, PDFs as strict source of truth).

Specs: Linux Mint 22 | Ryzen 9 5950X | 64GB RAM | RX 7800 XT (16GB VRAM)

Need advice on optimizing this AMD stack:

What’s the most performant route for RDNA3 right now? Ollama+ROCm or compiling llama.cpp with ROCm directly?
Is OpenWebUI the standard for reliable document retrieval, or is there a better, specialized UI for technical files?
Best practices for offloading layers (16GB VRAM -> 64GB System RAM) on AMD without crippling prompt evaluation speed?

[-]

overcompensk8@reddit

Out of VRAM - check nvidia-smi

[-]

Constant-Bonus-7168@reddit

Appreciate this — the community benefits from thoughtful posts like this.

[-]

Practical-Charge8321@reddit

I was pretty happy with qwen 3 8B from a quality perspective, but it's pretty slow on my tiny 8GB VRAM rig

[-]

vex_humanssucks@reddit

Good list. One I'd throw in the "solid mid-tier" bucket that doesn't get enough mentions: running Qwen3.6-27B at Q4 on 24GB VRAM is genuinely usable for agentic tasks where you need speed over max quality. The quantized version holds up surprisingly well on structured output generation compared to the full weight.

[-]

Artanisx@reddit

Need recommendations for Coding :) M/L = 24 gb VRAM and/or 64gb RAM

everything better is quite large, how much ram do you have?

(also, loom mentioned, which do you use? i've been using tapestry-loom)

[-]

rileyphone@reddit

Trying Kimi Linear Base now. Mostly working on my own loom projects, right now one in a phone form factor.

[-]

Ok-Internal9317@reddit

hmm you seems to be outdated, i havent heard of llama3.1 since a long time

[-]

HopePupal@reddit

i'm not using them, but there are base models available for Gemma 4, Nemotron 3, the smaller versions of Qwen 3.5 (including 35B-A3B but not 27B)

[-]

pmttyji@reddit

I'm back to using Llama 3.1 8b local but there has to be something better that isn't annealed to death.

https://huggingface.co/shb777/Llama-3.3-8B-Instruct-128K - u/FizzarolliAI

[-]

officialAdfs_m0vie@reddit

General usage, Medium (I have 16gigs of vram)

[-]

Joozio@reddit

For coding specifically the local vs frontier gap narrowed a lot. Ran Aider with a few local models alongside Claude Code and Codex CLI - the harness configuration made more difference than model choice at the margin. Not sure this generalizes but for structured coding tasks local at 70B+ is surprisingly close.

[-]

Skid_gates_99@reddit

Qwen3.5-27B on a single 3090 for most of my agentic work. bartowski Q6_K quant, 64k context, thinking off for tool calls because it wastes tokens reasoning about which function to invoke when the schema already tells it everything it needs to know. Gets me around 20 t/s on generation which is fine for agent loops where the bottleneck is the tool execution anyway.

Tried Gemma 4 26B for a week and went back. Quality is genuinely good when it works but the crashes and the tool call formatting issues killed my trust. I need something I can leave running overnight on a multi step workflow without babysitting it. Qwen has been boring and reliable for that which is exactly what I want.

I’ve been testing the Gemma 4 / Qwen hybrid-attention costs as well. The thermal throttling is the real bottleneck. I actually managed to stabilize the flow using a deterministic routing logic (Dirichlet-Shift) that cuts redundant cycles. Verified a 16.8% energy recovery via ZKP, which keeps the clocks higher for longer. I’ve put the skeleton on GitHub if you want to see how to bypass the standard JAX-level friction: https://github.com/BerzeShift/Berze-Shift

Essentially, there is no reason AI should waste energy when 100% of the time in 1 million simulations that energy is useless. Like a human counting floor boards in every room it walks into or putting on a life jacket when they enter their 20th floor office to be safer.

[-]

different issues than GPT or Claude on the

same codebase — each model has systematic

blind spots the others don't share.

Has anyone done structured comparisons of

per-model accuracy on specific task types?

[-]

FlightCautious3748@reddit

minimax m2.7 has been the most useful for client work lately, team was skeptical but the throughput on longer context tasks is actually solid for the cost of running it locally

[-]

__Captain_Autismo__@reddit

Startup founder - 96gb vram ( rtx 6000 pro )

Coding, writing and web dev.

General purpose: Minimax 2.5 reap q4

Web dev: Gemma 4 31b it bf-16

Full tool use through both on my built from scratch agent harness. Manage workflows through my control surface.

Around 80-90% or more of my ai usage is now local and the workflows get better daily.

Size category: S (under 8GB VRAM). Making it work at this tier requires being very deliberate about quantization format and memory management.

[-]

Farmadupe@reddit

I've got a 3090 + 2060 at home, that's enough for models in the 30B range at \~q5 with llama.cpp

* For general agentic work, qwen3.5-27b at \~Q5 is just about on the right side of competent, with some handholding on MCPs that its given. But with a small set of tools, and the right carefully crated prompts, it can do useful stuff independently.
* For batched logging/classification, I switch to the smallest qwen3.5 possible that I can use. qwen3.5-9b and below fit entirely in the 3090 at fp8 so can run under vllm, which is way way faster and less buggy.
* for some tasks, I can switch to 122b or 397b, but they're orders of magntiude slower so don't get used much.
* qwen3.5 has SOTA-level image comprehension. there's no need to pay money for image classification tasks.
* gemma4 31b is roughly comparable to qwen3.5-27b but not quite as good. The only task that it really beats qwen3.5 is video comprehension. I can stuff 100k tokens into context and get better groundings than qwen3.5 series. The default persona is a bit more pliable than qwen3.5, which can be a bit robotic.

* honestly, at 32Gb vram, I don't think I can replace opus/codex for agentic coding yet. I make my paycheck of coding and qwen3.5-27b is too slow and not brainy enough for coding tasks.