PaMRxR

Can't get over 250TPS on RTX5090 with Qwen3.5-4B

Posted by luckyj@reddit | LocalLLaMA | View on Reddit | 30 comments

[-]

PaMRxR@reddit

Maybe this post will be interesting for you. It's for datacenter GPUs but it goes into a lot of details and I found it generally educational. https://blog.kog.ai/building-a-single-kernel-latency-optimized-llm-inference-engine-on-amd-mi300x-gpus/

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

[-]

PaMRxR@reddit (OP)

You could get some small old GPU just to connect your monitor to it, that is one easy and cheap way to solve this.

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

[-]

PaMRxR@reddit (OP)

> My main goal is to offload as much as possible to the CPU/DDR. So, no hardware acceleration where ever possible (all Foot terminals combined use 10MB -20MB). I'm still tweaking. But you might want to consider going a similar route, as every bit of additional VRAM is vital! > My main goal is to offload as much as possible to the CPU/DDR. The GUIs on my Ubuntu 24.04 were taking up a few 100 MB which I didn't think is worth bothering to optimize.

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

[-]

PaMRxR@reddit (OP)

Indeed sounds extremely similar hw-wise. Meanwhile I moved the 3090s to an older dual Xeon server I already had, with 2x PCIe 3.0 x16 slots running headless. This cuts the max bandwidth in half, but also bumps the min bandwidth by 4x. In the end it's a bit faster for -sm tensor with llama.cpp, nothing dramatic though.

I'm done with using local LLMs for coding

Posted by dtdisapointingresult@reddit | LocalLLaMA | View on Reddit | 810 comments

[-]

PaMRxR@reddit

Local models require significant time investment to learn a lot of details of how things work and how to efficiently make use of the hardware and model capabilities. Without some curiosity driving you into this people like the OP will fail. People that just want to use something and don't really care about the details.

I'm done with using local LLMs for coding

Posted by dtdisapointingresult@reddit | LocalLLaMA | View on Reddit | 810 comments

[-]

PaMRxR@reddit

Qwen3.6-27B Q8 and KV-cache BF16 is working very well for me with the pi-coding-agent on 2x3090 GPUs. But even with 1x3090 before I've had pretty good success with: Qwen3.5-27B, and before Devstral-Small-2 24B and Qwen3-Coder-Next, before Qwen3-32B, and so on. Maybe I just haven't been spoiled by the cloud models? I've only ever tried Kimi (2.5 I think) with a 1 week free trial. My local models occasionally stumble due to lack of some obscure knowledge, but pasting some doc into the context is really not that hard.

Every time a new model comes out, the old one is obsolete of course

Posted by FullChampionship7564@reddit | LocalLLaMA | View on Reddit | 198 comments

[-]

PaMRxR@reddit

So essentially just take and contribute nothing back.

Qwen 3.6 35B crushes Gemma 4 26B on my tests

Posted by Lowkey_LokiSN@reddit | LocalLLaMA | View on Reddit | 116 comments

[-]

PaMRxR@reddit

In your llama-swap config add something like this after the cmd for example. Then "<model-id>:instruct" will also be available without any swapping. filters: setParams: temperature: 0.6 top_p: 0.95 top_k: 20 min_p: 0.0 presence_penalty: 0.0 repetition_penalty: 1.0 setParamsByID: "${MODEL_ID}:instruct": temperature: 0.7 top_p: 0.8 top_k: 20 min_p: 0.0 repetition_penalty: 1.0 chat_template_kwargs: enable_thinking: false

Guys we have to change the pelican test

Posted by Tall-Ad-7742@reddit | LocalLLaMA | View on Reddit | 93 comments

[-]

PaMRxR@reddit

2 tries with Qwen3.5-35B-A3B, no amount of prompting can get it to make something coherent :| https://preview.redd.it/eyx1utlklevg1.png?width=782&format=png&auto=webp&s=3709b65de66e8b30e425129133cc99bcd70ea94f

Guys we have to change the pelican test

Posted by Tall-Ad-7742@reddit | LocalLLaMA | View on Reddit | 93 comments

[-]

PaMRxR@reddit

Qwen3.5-27B Q8 below. https://preview.redd.it/sqwi91tekevg1.png?width=728&format=png&auto=webp&s=b9617b4ae81668e81dd49a5c5d99b70577351b66

Updated Qwen3.5-9B Quantization Comparison

Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 106 comments

[-]

PaMRxR@reddit

> mradermacher's i1 quants are punching way above their weight I really wonder what they are doing! So far I was convinced byteshape are unbeatable per-byte, but it doesn't quite seem so with these results.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Posted by PerceptionGrouchy187@reddit | LocalLLaMA | View on Reddit | 117 comments

[-]

PaMRxR@reddit

Prompt eval is much slower with -sm graph though, or?

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Posted by PerceptionGrouchy187@reddit | LocalLLaMA | View on Reddit | 117 comments

[-]

PaMRxR@reddit

Maybe it counts context as well, that's why it shows larger?

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Posted by PerceptionGrouchy187@reddit | LocalLLaMA | View on Reddit | 117 comments

[-]

PaMRxR@reddit

With multi-gpu definitely add an option like this: --device-draft CUDA0, otherwise it was pretty much same as baseline for me. With that tg went from 23 -> 36 for me with [IQ2_M](https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/blob/main/google_gemma-4-E2B-it-IQ2_M.gguf) (and 34 with UD-IQ2_M)

Weekend project with Intel B70s

Posted by dev_is_active@reddit | LocalLLaMA | View on Reddit | 41 comments

[-]

PaMRxR@reddit

Yeah that looks unusual, like they are blowing into each other.

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

[-]

PaMRxR@reddit (OP)

For which one do you refer? I run models at most at ~Q8_0 quantization.

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

[-]

PaMRxR@reddit (OP)

I just rebuilt it to try -sm tensor, but it keeps crashing as soon as it gets done with prompt processing unfortunately. Probably needs some time for issues to be ironed out.

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

[-]

PaMRxR@reddit (OP)

I meant slower compared to something like tensor parallel which utilizes the GPUs much better. Otherwise -sm layer is slower in comparison to one 3090 mainly if you load a larger quant I'd guess.

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

[-]

PaMRxR@reddit (OP)

So the bandwidth affects pp too, although a little less than tg (17% vs 26%). I just tried ik_llama with -sm graph and getting 35.4 tg, pretty much same as you! But pp tanks from 1950 to 770 (-b/-ub 512), or 870 (-b/-ub 4096) for me. Do you have the startup command? Maybe I'm missing/misusing some parameter. Here's what I used: ${ik-llama-server} --parallel 1 --model "${models_path}/Qwen3.5-27B-GGUF/Qwen3.5-27B-Q8_0-11.gguf" --mmproj "${models_path}/Qwen3.5-27B-GGUF/mmproj/mmproj-BF16.gguf" --chat-template-file "${models_path}/chat_templates/Qwen3.5.txt" --split-mode graph --fit --fit-margin 512 --seed 42 --ctx-size 100000 --n-gpu-layers 999 --jinja --peg -fa on --no-context-shift -b 2048 -ub 2048 --cache-ram 30000

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

[-]

PaMRxR@reddit (OP)

Do you run llama.cpp or vLLM? Could you also share what PCIe configuration you have?

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

[-]

PaMRxR@reddit (OP)

It's difficult to find quantizations that are well optimized for 2x3090s (48GB VRAM + Q8/BF16 native support). I really think more people can benefit by tuning models specifically for their systems. With --split-mode layer the slow transfer doesn't really matter as far as I know, but the GPUs are only utilized like 50% so I think tg is at least 2x slower than it could be.

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

[-]

PaMRxR@reddit (OP)

Thanks for sharing mate, it looks very similar to my numbers. My 27B quant is 31GB and I can fit 200k context. I don't power limit which maybe explains a little faster tg of 25. Have you tried --split-mode row in llama.cpp, or maybe vLLM with tensor parallel=2?

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

[-]

PaMRxR@reddit (OP)

Qwen3.5 122B does not fit fully of course, the quant I use is 72 GB, so ~36GB experts run in system RAM. I find it generally a bit dumber than Qwen 3.5 27B anyway, but occasionally it comes ahead in debugging tasks where the little more knowledge of more obscure details helps. Actually I run a VERY similar system of Qwen3.5 27B agent + Qwen 3.5 35B-A3B subagent. But I run each fully on both cards at a Q8_0+, swapping them back and forth with llama-swap. Have you tried such an arrangement? It's slower with the swapping, pp is seriously faster, and finally tg is slower due to larger weights I think. Whether it has a significant impact on quality is hard to tell though. Do you find the Gemma4 combo better than Qwen3.5 btw?

Dual 3090 setup - performance optimization

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 45 comments

[-]

PaMRxR@reddit (OP)

Just yesterday I tried actually compiling that P2P driver patch, and ran into various issues.. both compilation and linking errors. I don't expect miracles from it anyway, the PCIe 3 slot is physically max 4x - 4GB/s.

96GB Vram. What to run in 2026?

Posted by inthesearchof@reddit | LocalLLaMA | View on Reddit | 88 comments

[-]

PaMRxR@reddit

Give it a try with ik_llama.cpp, I use it specifically for Qwen3.5 122B-A10B with 2x24GB in VRAM + 35GB in RAM, which in scale I think is kinda similar to what you are trying. I'm getting 1000 pp/s, a lot better than llama.cpp.

ASUS X99-E WS with 2x 3090. Anyone was able to set it up?

Posted by novanet-central@reddit | LocalLLaMA | View on Reddit | 6 comments

[-]

PaMRxR@reddit

I'm curious what inference performance you manage to get with your setup. I just posted about my experience with 2x 3090, but worse motherboard.

Qwen3.5-35B-A3B Q4 Performance on Intel Arc B60?

Posted by LeDynamique@reddit | LocalLLaMA | View on Reddit | 5 comments

[-]

PaMRxR@reddit

Sounds like something is not right with the implementation or configuration, because on a 6-core CPU alone (GPU disabled) I get 13t/s..

Update on Qwen 3.5 35B A3B on Raspberry PI 5

Posted by jslominski@reddit | LocalLLaMA | View on Reddit | 37 comments

[-]

PaMRxR@reddit

With the 2.9bpw I'm getting 5.5-6 tok/s, is it similar for you? I wonder how they reached 8 tok/s. Maybe the llama-server from < 18 February was faster, or I need to add some cooling :)

Update on Qwen 3.5 35B A3B on Raspberry PI 5

Posted by jslominski@reddit | LocalLLaMA | View on Reddit | 37 comments

[-]

PaMRxR@reddit

Great I'm also just about to set it up, have the 2.9bpw downloaded.

Qwen3.5 2b, 4b and 9b tested on Raspberry Pi5

Posted by jslominski@reddit | LocalLLaMA | View on Reddit | 36 comments

[-]

PaMRxR@reddit

Fantastic, thanks for sharing the code and your experience!

Qwen3.5 2b, 4b and 9b tested on Raspberry Pi5

Posted by jslominski@reddit | LocalLLaMA | View on Reddit | 36 comments

[-]

PaMRxR@reddit

Do you have some cooling on the raspberry pi? I wonder if it can handle the inference workload without any cooling.

Qwen3.5-27B Q4 Quantization Comparison

Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 116 comments

[-]

PaMRxR@reddit

Phew good thing I run Linux! Otherwise it would've been a pain as I connect remotely to my machine some 10km away.

Qwen3.5-27B Q4 Quantization Comparison

Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 116 comments

[-]

PaMRxR@reddit

That was fast! Thanks for sharing this great work mate.

Qwen3.5-27B Q4 Quantization Comparison

Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 116 comments

[-]

PaMRxR@reddit

I've been trying to run kld-sweep myself and have a short suggestion for improvement. In addition to --args it could suport --args-quants. For running the full bf16 model I find that I need different parameters as it doesn't fit in VRAM for me, compared to the quants.

Update on Qwen 3.5 35B A3B on Raspberry PI 5

Posted by jslominski@reddit | LocalLLaMA | View on Reddit | 37 comments

[-]

PaMRxR@reddit

Have you seen ByteShape's work? Latest they [reported](https://byteshape.com/blogs/Devstral-Small-2-24B-Instruct-2512/) up to 9tps for Qwen3-Coder-30B-A3B on Pi 5. Unfortunately they haven't released anything for Qwen3.5 yet.

What tokens/sec do you get when running Qwen 3.5 27B?

Posted by thegr8anand@reddit | LocalLLaMA | View on Reddit | 194 comments

[-]

PaMRxR@reddit

Thanks for sharing, I see similar performance bump which correlates roughly with the ~10% smaller size. According to KLD benchmarks the IQ4_XS is even better (and actually really good for its size), but both are proportionally worse than bigger quants. Anyway I'll give it a try for a few days! https://old.reddit.com/r/LocalLLaMA/comments/1rk5qmr/qwen3527b_q4_quantization_comparison/

Open WebUI’s New Open Terminal + “Native” Tool Calling + Qwen3.5 35b = Holy Sh!t!!!

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 221 comments

[-]

PaMRxR@reddit

Take it easy mate. I shared my experience, you shared.. something else, let's move on now.

Open WebUI’s New Open Terminal + “Native” Tool Calling + Qwen3.5 35b = Holy Sh!t!!!

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 221 comments

[-]

PaMRxR@reddit

It's not "a feeling", I'm using Q4_K_M myself with 121k context on 24 GB VRAM (and just -ncmoe 4).

What tokens/sec do you get when running Qwen 3.5 27B?

Posted by thegr8anand@reddit | LocalLLaMA | View on Reddit | 194 comments

[-]

PaMRxR@reddit

RTX 3090, 32-34 t/s @ d40000 using 27B-UD-Q4_K_XL. Practically the same!

Open WebUI’s New Open Terminal + “Native” Tool Calling + Qwen3.5 35b = Holy Sh!t!!!

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 221 comments

[-]

PaMRxR@reddit

A few GB go to RAM indeed, but it doesn't have a big effect on the speed. I avoid kv cache quantization unless really necessary.

Open WebUI’s New Open Terminal + “Native” Tool Calling + Qwen3.5 35b = Holy Sh!t!!!

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 221 comments

[-]

PaMRxR@reddit

Have you noticed that enabling Native changes dramatically the amount of thinking this model does? It goes from a minute in general to a few seconds for me. I can't figure out what the reason for this is, maybe the system prompt changes as Open WebUI sends tool documentations?

Open WebUI’s New Open Terminal + “Native” Tool Calling + Qwen3.5 35b = Holy Sh!t!!!

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 221 comments

[-]

PaMRxR@reddit

That sounds like some very outdated advice. Qwen3.5 is incredibly efficient with the context.

Open WebUI’s New Open Terminal + “Native” Tool Calling + Qwen3.5 35b = Holy Sh!t!!!

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 221 comments

[-]

PaMRxR@reddit

I run Q4_K_M (AesSedai) with these settings: --ctx-size 150000 --n-gpu-layers all --fit-target 256 --fit on -ncmoe 4 --swa-full -fa on Getting 2510 pp and 82 tg on a 3090.

Qwen3.5-27B Q4 Quantization Comparison

Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 116 comments

[-]

PaMRxR@reddit

I made a bit different plot of the first table showing quantization size vs. KLD. Quantizations under or close to the best fit line should be preferable I suppose. https://preview.redd.it/eh3fdawsnymg1.png?width=1000&format=png&auto=webp&s=39c7febfc9f9193c3d1629889c3361e4352bc5d4

Qwen3.5-27B Q4 Quantization Comparison

Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 116 comments

[-]

PaMRxR@reddit

Do different sampling parameters (temp, top-p/min-p) have an effect on these benchmarks? Would be great if you published also the parameters you used.

Qwen 3.5 Plus(397b-a17b) is now available on Chinese Qwen APP

Posted by AaronFeng47@reddit | LocalLLaMA | View on Reddit | 22 comments

[-]

PaMRxR@reddit

You have to turn on Thinking for "tricky" questions like this.

local vibe coding

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 146 comments

[-]

PaMRxR@reddit

Similarly using Devstral Small 2 Q4 locally on an RTX 3090 with 200k context. It's really snappy. Also experimenting with Qwen3-Coder-Next which feels quite smarter, but needs more than 32 GB RAM (in addition to 24 GB VRAM) to be usable at Q4. Still looking for the right agent tool. Of the ones I tried so far, Mistral Vibe has been my favorite.

Coding agent for local LLMs?

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

PaMRxR@reddit (OP)

I'm aware of pi, a minor problem is it doesn't support MCP currently. Linux and llama-server is exactly what I run btw. But I just came across [oh-my-pi](https://github.com/can1357/oh-my-pi) which looks like a seriously upgraded fork of pi, worth a try as well!

Coding agent for local LLMs?

Posted by PaMRxR@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

PaMRxR@reddit (OP)

Aider development seems to have kinda stalled unfortunately, with the last release in Aug 2025. OpenCode is ok, I just wish it was configurable with regards to the system prompt. Disabling tools seems to still add their documentation into the system message.

Qwen/Qwen3-Coder-Next · Hugging Face

Posted by coder543@reddit | LocalLLaMA | View on Reddit | 248 comments

[-]

PaMRxR@reddit

Enabling OpenBLAS when building llama.cpp seems to bring another ~5-10%: prompt eval time = 12353.00 ms / 14302 tokens ( 0.86 ms per token, 1157.78 tokens per second) eval time = 59318.07 ms / 1957 tokens ( 30.31 ms per token, 32.99 tokens per second)