New_Spray_7886

server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 40 comments

What is the smallest amount of RAM sufficient to run any available on HF GGUF LLM model locally?

Posted by alex20_202020@reddit | LocalLLaMA | View on Reddit | 36 comments

Using Intel Arc Pro series, any thoughts ?

Posted by BikerBoyRoy123@reddit | LocalLLaMA | View on Reddit | 32 comments

If human brains are equivalent to 100T param LLMs and current SOTA local models are 1-2T params (basically cat brains) are we going to hit an intelligence wall for local models soon?

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 38 comments

AMD Hipfire - a new inference engine optimized for AMD GPU's

Posted by Thrumpwart@reddit | LocalLLaMA | View on Reddit | 87 comments

New_Spray_7886@reddit

Noob question - where is the list of the current supported architectures? I’ve looked around the docs on the github but not finding it, curious about gfx1030

FINAL-Bench/Darwin-36B-Opus · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 21 comments

New_Spray_7886@reddit

“But it wastes everyone’s fucking time. It pollutes leaderboards ppl actually look at.“ I downloaded a couple quants of the 3.5 version of this. It was one of the worst varieties of qwen-3.5 I tried, completely unusable and looped endlessly. I usually put the models I’ll seldom use in longterm storage on my nas considering the bandwidth they take to download - this one I just deleted since it was a complete waste

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part.

Posted by marlang@reddit | LocalLLaMA | View on Reddit | 152 comments

New_Spray_7886@reddit

Here are two settings. 1. I use this when I want more ram available to use the computer at the same time (i.e. web-browsing), it is like OPs. Qwen3.6-35B-A3B-IQ4\_XS (Bartowski) is 24-25 t/s here @ no context, 22 t/s @ 20% context (50k or so). Q4\_K\_M is a little slower at 20.5 t/s @ 20% context. I have many quants left to try but I like IQ4\_XS so far. export HSA_OVERRIDE_GFX_VERSION=10.3.0 llama-server \ --model /path/to/Qwen_Qwen3.6-35B-A3B-IQ4_XS.gguf \ --jinja \ -c 0 \ -ngl 99 \ --no-mmap \ --cpu-moe \ --n-cpu-moe 186 \ --min-p 0.0 \ --top-p 0.95 \ --top-k 20 \ --temp 0.6 \ --parallel 1 \ --chat-template-kwargs '{"preserve_thinking":true}' \ --host 127.0.0.1 \ --port 8033 2>&1 | tee /path/to/log.txt 2. I use this if I'm not also using the computer - maybe this will be agents running overnight soon. Llama.cpp maximizes the performance by doing the fitting for you, so it is easier than testing how many layers you can offload. \~27.4 tok/s at no context as above. export HSA_OVERRIDE_GFX_VERSION=10.3.0 llama-server \ --model /path/to/Qwen_Qwen3.6-35B-A3B-IQ4_XS.gguf \ --jinja \ -c 0 \ --no-mmap \ --fit on \ --min-p 0.0 \ --top-p 0.95 \ --top-k 20 \ --temp 0.6 \ --parallel 1 \ --chat-template-kwargs '{"preserve_thinking":true}' \ --host 127.0.0.1 \ --port 8033 2>&1 | tee /path/to/log.txt The logging is nice as if it runs slower than you want you can just ask the LLM to calculate how many --n-cpu-moe layers you should offload by uploading that file & your server start-up command. I tested smaller context sizes and the speeds were very minimally different on my setup so I'm keeping the max currently. llama.cpp compiled with rocm, wmma, amdgpu_targets_gfx1030, and amdgpu_targets_gfx1031 OS: Gentoo Linux x86_64, Host: Z490 UD AC Kernel: Linux 6.6.35-gentoo-dist DE: Cinnamon 6.4.13 CPU: Intel(R) Core(TM) i7-10700K (16) @ 5.10 GHz GPU: AMD Radeon RX 6700 XT [Discrete] Memory: 22.44 GiB / 31.27 GiB (72%)

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part.

Posted by marlang@reddit | LocalLLaMA | View on Reddit | 152 comments

New_Spray_7886@reddit

I get 25 t/s with a 6700xt +32gb ram when setting aside vram for full context (prefill is ~300 t/s), so you should be quite workably higher than that. This qwen-3.6 moe is quite a bit more performant than even gemma-4 on these older consumer setups