PaMRxR

Can't get over 250TPS on RTX5090 with Qwen3.5-4B

Posted by luckyj@reddit | LocalLLaMA | View on Reddit | 30 comments

PaMRxR@reddit

Maybe this post will be interesting for you. It's for datacenter GPUs but it goes into a lot of details and I found it generally educational. https://blog.kog.ai/building-a-single-kernel-latency-optimized-llm-inference-engine-on-amd-mi300x-gpus/

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

PaMRxR@reddit (OP)

> My main goal is to offload as much as possible to the CPU/DDR. So, no hardware acceleration where ever possible (all Foot terminals combined use 10MB -20MB). I'm still tweaking. But you might want to consider going a similar route, as every bit of additional VRAM is vital! > My main goal is to offload as much as possible to the CPU/DDR. The GUIs on my Ubuntu 24.04 were taking up a few 100 MB which I didn't think is worth bothering to optimize.

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

PaMRxR@reddit (OP)

Indeed sounds extremely similar hw-wise. Meanwhile I moved the 3090s to an older dual Xeon server I already had, with 2x PCIe 3.0 x16 slots running headless. This cuts the max bandwidth in half, but also bumps the min bandwidth by 4x. In the end it's a bit faster for -sm tensor with llama.cpp, nothing dramatic though.

I'm done with using local LLMs for coding

Posted by dtdisapointingresult@reddit | LocalLLaMA | View on Reddit | 810 comments

PaMRxR@reddit

Local models require significant time investment to learn a lot of details of how things work and how to efficiently make use of the hardware and model capabilities. Without some curiosity driving you into this people like the OP will fail. People that just want to use something and don't really care about the details.

I'm done with using local LLMs for coding

Posted by dtdisapointingresult@reddit | LocalLLaMA | View on Reddit | 810 comments

PaMRxR@reddit

Qwen3.6-27B Q8 and KV-cache BF16 is working very well for me with the pi-coding-agent on 2x3090 GPUs. But even with 1x3090 before I've had pretty good success with: Qwen3.5-27B, and before Devstral-Small-2 24B and Qwen3-Coder-Next, before Qwen3-32B, and so on. Maybe I just haven't been spoiled by the cloud models? I've only ever tried Kimi (2.5 I think) with a 1 week free trial. My local models occasionally stumble due to lack of some obscure knowledge, but pasting some doc into the context is really not that hard.

Every time a new model comes out, the old one is obsolete of course

Posted by FullChampionship7564@reddit | LocalLLaMA | View on Reddit | 198 comments

Qwen 3.6 35B crushes Gemma 4 26B on my tests

Posted by Lowkey_LokiSN@reddit | LocalLLaMA | View on Reddit | 116 comments

PaMRxR@reddit

In your llama-swap config add something like this after the cmd for example. Then "<model-id>:instruct" will also be available without any swapping. filters: setParams: temperature: 0.6 top_p: 0.95 top_k: 20 min_p: 0.0 presence_penalty: 0.0 repetition_penalty: 1.0 setParamsByID: "${MODEL_ID}:instruct": temperature: 0.7 top_p: 0.8 top_k: 20 min_p: 0.0 repetition_penalty: 1.0 chat_template_kwargs: enable_thinking: false

Guys we have to change the pelican test

Posted by Tall-Ad-7742@reddit | LocalLLaMA | View on Reddit | 93 comments

PaMRxR@reddit

2 tries with Qwen3.5-35B-A3B, no amount of prompting can get it to make something coherent :| https://preview.redd.it/eyx1utlklevg1.png?width=782&format=png&auto=webp&s=3709b65de66e8b30e425129133cc99bcd70ea94f

Guys we have to change the pelican test

Posted by Tall-Ad-7742@reddit | LocalLLaMA | View on Reddit | 93 comments

PaMRxR@reddit

Qwen3.5-27B Q8 below. https://preview.redd.it/sqwi91tekevg1.png?width=728&format=png&auto=webp&s=b9617b4ae81668e81dd49a5c5d99b70577351b66

Updated Qwen3.5-9B Quantization Comparison

Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 106 comments

PaMRxR@reddit

> mradermacher's i1 quants are punching way above their weight I really wonder what they are doing! So far I was convinced byteshape are unbeatable per-byte, but it doesn't quite seem so with these results.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Posted by PerceptionGrouchy187@reddit | LocalLLaMA | View on Reddit | 117 comments

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Posted by PerceptionGrouchy187@reddit | LocalLLaMA | View on Reddit | 117 comments

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Posted by PerceptionGrouchy187@reddit | LocalLLaMA | View on Reddit | 117 comments

PaMRxR@reddit

With multi-gpu definitely add an option like this: --device-draft CUDA0, otherwise it was pretty much same as baseline for me. With that tg went from 23 -> 36 for me with [IQ2_M](https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/blob/main/google_gemma-4-E2B-it-IQ2_M.gguf) (and 34 with UD-IQ2_M)

Weekend project with Intel B70s

Posted by dev_is_active@reddit | LocalLLaMA | View on Reddit | 41 comments

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

PaMRxR@reddit (OP)

I just rebuilt it to try -sm tensor, but it keeps crashing as soon as it gets done with prompt processing unfortunately. Probably needs some time for issues to be ironed out.

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

PaMRxR@reddit (OP)

I meant slower compared to something like tensor parallel which utilizes the GPUs much better. Otherwise -sm layer is slower in comparison to one 3090 mainly if you load a larger quant I'd guess.

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

PaMRxR@reddit (OP)

So the bandwidth affects pp too, although a little less than tg (17% vs 26%). I just tried ik_llama with -sm graph and getting 35.4 tg, pretty much same as you! But pp tanks from 1950 to 770 (-b/-ub 512), or 870 (-b/-ub 4096) for me. Do you have the startup command? Maybe I'm missing/misusing some parameter. Here's what I used: ${ik-llama-server} --parallel 1 --model "${models_path}/Qwen3.5-27B-GGUF/Qwen3.5-27B-Q8_0-11.gguf" --mmproj "${models_path}/Qwen3.5-27B-GGUF/mmproj/mmproj-BF16.gguf" --chat-template-file "${models_path}/chat_templates/Qwen3.5.txt" --split-mode graph --fit --fit-margin 512 --seed 42 --ctx-size 100000 --n-gpu-layers 999 --jinja --peg -fa on --no-context-shift -b 2048 -ub 2048 --cache-ram 30000

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

PaMRxR@reddit (OP)

It's difficult to find quantizations that are well optimized for 2x3090s (48GB VRAM + Q8/BF16 native support). I really think more people can benefit by tuning models specifically for their systems. With --split-mode layer the slow transfer doesn't really matter as far as I know, but the GPUs are only utilized like 50% so I think tg is at least 2x slower than it could be.

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

PaMRxR@reddit (OP)

Thanks for sharing mate, it looks very similar to my numbers. My 27B quant is 31GB and I can fit 200k context. I don't power limit which maybe explains a little faster tg of 25. Have you tried --split-mode row in llama.cpp, or maybe vLLM with tensor parallel=2?

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

PaMRxR@reddit (OP)

Qwen3.5 122B does not fit fully of course, the quant I use is 72 GB, so ~36GB experts run in system RAM. I find it generally a bit dumber than Qwen 3.5 27B anyway, but occasionally it comes ahead in debugging tasks where the little more knowledge of more obscure details helps. Actually I run a VERY similar system of Qwen3.5 27B agent + Qwen 3.5 35B-A3B subagent. But I run each fully on both cards at a Q8_0+, swapping them back and forth with llama-swap. Have you tried such an arrangement? It's slower with the swapping, pp is seriously faster, and finally tg is slower due to larger weights I think. Whether it has a significant impact on quality is hard to tell though. Do you find the Gemma4 combo better than Qwen3.5 btw?

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

PaMRxR@reddit (OP)

Just yesterday I tried actually compiling that P2P driver patch, and ran into various issues.. both compilation and linking errors. I don't expect miracles from it anyway, the PCIe 3 slot is physically max 4x - 4GB/s.

96GB Vram. What to run in 2026?

Posted by inthesearchof@reddit | LocalLLaMA | View on Reddit | 88 comments

PaMRxR@reddit

Give it a try with ik_llama.cpp, I use it specifically for Qwen3.5 122B-A10B with 2x24GB in VRAM + 35GB in RAM, which in scale I think is kinda similar to what you are trying. I'm getting 1000 pp/s, a lot better than llama.cpp.

ASUS X99-E WS with 2x 3090. Anyone was able to set it up?

Posted by novanet-central@reddit | LocalLLaMA | View on Reddit | 6 comments

PaMRxR@reddit

I'm curious what inference performance you manage to get with your setup. I just posted about my experience with 2x 3090, but worse motherboard.

Qwen3.5-35B-A3B Q4 Performance on Intel Arc B60?

Posted by LeDynamique@reddit | LocalLLaMA | View on Reddit | 5 comments

PaMRxR@reddit

Sounds like something is not right with the implementation or configuration, because on a 6-core CPU alone (GPU disabled) I get 13t/s..

Update on Qwen 3.5 35B A3B on Raspberry PI 5

Posted by jslominski@reddit | LocalLLaMA | View on Reddit | 37 comments

PaMRxR@reddit

With the 2.9bpw I'm getting 5.5-6 tok/s, is it similar for you? I wonder how they reached 8 tok/s. Maybe the llama-server from < 18 February was faster, or I need to add some cooling :)

Update on Qwen 3.5 35B A3B on Raspberry PI 5

Posted by jslominski@reddit | LocalLLaMA | View on Reddit | 37 comments

Qwen3.5 2b, 4b and 9b tested on Raspberry Pi5

Posted by jslominski@reddit | LocalLLaMA | View on Reddit | 36 comments

Qwen3.5 2b, 4b and 9b tested on Raspberry Pi5

Posted by jslominski@reddit | LocalLLaMA | View on Reddit | 36 comments

Qwen3.5-27B Q4 Quantization Comparison

Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 116 comments

Qwen3.5-27B Q4 Quantization Comparison

Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 116 comments

Qwen3.5-27B Q4 Quantization Comparison

Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 116 comments

PaMRxR@reddit

I've been trying to run kld-sweep myself and have a short suggestion for improvement. In addition to --args it could suport --args-quants. For running the full bf16 model I find that I need different parameters as it doesn't fit in VRAM for me, compared to the quants.

Update on Qwen 3.5 35B A3B on Raspberry PI 5

Posted by jslominski@reddit | LocalLLaMA | View on Reddit | 37 comments

PaMRxR@reddit

Have you seen ByteShape's work? Latest they [reported](https://byteshape.com/blogs/Devstral-Small-2-24B-Instruct-2512/) up to 9tps for Qwen3-Coder-30B-A3B on Pi 5. Unfortunately they haven't released anything for Qwen3.5 yet.

What tokens/sec do you get when running Qwen 3.5 27B?

Posted by thegr8anand@reddit | LocalLLaMA | View on Reddit | 194 comments

PaMRxR@reddit

Thanks for sharing, I see similar performance bump which correlates roughly with the ~10% smaller size. According to KLD benchmarks the IQ4_XS is even better (and actually really good for its size), but both are proportionally worse than bigger quants. Anyway I'll give it a try for a few days! https://old.reddit.com/r/LocalLLaMA/comments/1rk5qmr/qwen3527b_q4_quantization_comparison/

Open WebUI’s New Open Terminal + “Native” Tool Calling + Qwen3.5 35b = Holy Sh!t!!!

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 221 comments

Open WebUI’s New Open Terminal + “Native” Tool Calling + Qwen3.5 35b = Holy Sh!t!!!

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 221 comments

What tokens/sec do you get when running Qwen 3.5 27B?

Posted by thegr8anand@reddit | LocalLLaMA | View on Reddit | 194 comments

Open WebUI’s New Open Terminal + “Native” Tool Calling + Qwen3.5 35b = Holy Sh!t!!!

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 221 comments

PaMRxR@reddit

A few GB go to RAM indeed, but it doesn't have a big effect on the speed. I avoid kv cache quantization unless really necessary.

Open WebUI’s New Open Terminal + “Native” Tool Calling + Qwen3.5 35b = Holy Sh!t!!!

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 221 comments

PaMRxR@reddit

Have you noticed that enabling Native changes dramatically the amount of thinking this model does? It goes from a minute in general to a few seconds for me. I can't figure out what the reason for this is, maybe the system prompt changes as Open WebUI sends tool documentations?

Open WebUI’s New Open Terminal + “Native” Tool Calling + Qwen3.5 35b = Holy Sh!t!!!

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 221 comments

Open WebUI’s New Open Terminal + “Native” Tool Calling + Qwen3.5 35b = Holy Sh!t!!!

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 221 comments

PaMRxR@reddit

I run Q4_K_M (AesSedai) with these settings: --ctx-size 150000 --n-gpu-layers all --fit-target 256 --fit on -ncmoe 4 --swa-full -fa on Getting 2510 pp and 82 tg on a 3090.

Qwen3.5-27B Q4 Quantization Comparison

Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 116 comments

PaMRxR@reddit

I made a bit different plot of the first table showing quantization size vs. KLD. Quantizations under or close to the best fit line should be preferable I suppose. https://preview.redd.it/eh3fdawsnymg1.png?width=1000&format=png&auto=webp&s=39c7febfc9f9193c3d1629889c3361e4352bc5d4

Qwen3.5-27B Q4 Quantization Comparison

Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 116 comments

PaMRxR@reddit

Do different sampling parameters (temp, top-p/min-p) have an effect on these benchmarks? Would be great if you published also the parameters you used.

Qwen 3.5 Plus(397b-a17b) is now available on Chinese Qwen APP

Posted by AaronFeng47@reddit | LocalLLaMA | View on Reddit | 22 comments

local vibe coding

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 146 comments

PaMRxR@reddit

Similarly using Devstral Small 2 Q4 locally on an RTX 3090 with 200k context. It's really snappy. Also experimenting with Qwen3-Coder-Next which feels quite smarter, but needs more than 32 GB RAM (in addition to 24 GB VRAM) to be usable at Q4. Still looking for the right agent tool. Of the ones I tried so far, Mistral Vibe has been my favorite.

Coding agent for local LLMs?

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 18 comments

PaMRxR@reddit (OP)

I'm aware of pi, a minor problem is it doesn't support MCP currently. Linux and llama-server is exactly what I run btw. But I just came across [oh-my-pi](https://github.com/can1357/oh-my-pi) which looks like a seriously upgraded fork of pi, worth a try as well!

Coding agent for local LLMs?

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 18 comments

PaMRxR@reddit (OP)

Aider development seems to have kinda stalled unfortunately, with the last release in Aug 2025. OpenCode is ok, I just wish it was configurable with regards to the system prompt. Disabling tools seems to still add their documentation into the system message.

Qwen/Qwen3-Coder-Next · Hugging Face

Posted by coder543@reddit | LocalLLaMA | View on Reddit | 248 comments

PaMRxR@reddit

Enabling OpenBLAS when building llama.cpp seems to bring another ~5-10%: prompt eval time = 12353.00 ms / 14302 tokens ( 0.86 ms per token, 1157.78 tokens per second) eval time = 59318.07 ms / 1957 tokens ( 30.31 ms per token, 32.99 tokens per second)