TheaterFire

Qwen3-30B-A3B runs at 12-15 tokens-per-second on CPU

Posted by AlgorithmicKing@reddit | LocalLLaMA | View on Reddit | 227 comments

CPU: AMD Ryzen 9 7950x3d RAM: 32 GB I am using the UnSloth Q6\_K version of Qwen3-30B-A3B ([Qwen3-30B-A3B-Q6\_K.gguf · unsloth/Qwen3-30B-A3B-GGUF at main](https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/blob/main/Qwen3-30B-A3B-Q6_K.gguf))

Reply to Post

227 Comments

DrVonSinistro@reddit

235B-A22B Q4 runs at 2.39 t/s on a old server with Quad channel DDR4. (5080 tokens generated)
View on Reddit #54963614

plopperzzz@reddit

Yeah, I have one with dual xeon E5-2697A V4, 160GB of RAM, a Tesla M40 24GB, and a Quadro M4000. The entire thing cost me around $700 CAD, and mostly for the RAM and M40, and i get 3 t/s. However, from what i am hearing about Qwen3 30B A3B, I doubt i will keep running the 235B.
View on Reddit #54978289

---j0k3r---@reddit

do you have any specific settings? i have somehow similiar setup (with M60) and while it runs somewhat usable but i have to wait almost 50mins for it to start "thinking" and from btop it seems using more cpu than gpu :-(
View on Reddit #65541532

Klutzy_Can_5909@reddit

Tesla M40 is way too slow, it has only 288GB/s bandwidth and 6TFlops, try get a Volta/Turing GPU with Tensor cores. I'm not sure what you can get in your local market. I recently bought an AMD MI50 32G (no tensor cores but HBM2 memory) recently for only $150. And there are other options like V100 sxm2 16G (with a sxm2 to pcie card) and 2080Ti 11/22G
View on Reddit #55579121

Jimster480@reddit

Yes, but at what context size and what are the actual things that you're providing? Because I can tell you that running 10k context, for example, the AI (Qwen3 14b)will slow down to around 5 tokens a second using a Threadripper 3960X and having partial GPU acceleration through Vulkan.
View on Reddit #56631069

DrVonSinistro@reddit

tests were done with context set to 32k and I sent a 15k prompt to refactor some code. I have 60GB offloaded to 3 cuda GPUs.
View on Reddit #56636842

Jimster480@reddit

Which GPUs are you using?
View on Reddit #56842659

a_beautiful_rhind@reddit

Dense 70b runs about that fast on dual socket xeon with 2400MT/s memory. Since quants appear fixed, eager to see what happens once I download. If that's the kind of speeds I get along with GPUs then these large MoE being a meme is fully confirmed.
View on Reddit #54985957

Dyonizius@reddit

> dual that's lga2011 right?  do you use copies=2 or some other trick? are layers crossing the interlink?
View on Reddit #56244934

a_beautiful_rhind@reddit

LGA 3647. for llama.cpp I put --numa distribute
View on Reddit #56245068

Dyonizius@reddit

so when i set --numa distribute the model loads very slowly like 200mb/s which is strange since QPI link should be at least 16-32GB/s, I'll end up putting denser ram sticks and running single node... what kind of performance you get on the 30B moe?
View on Reddit #56265557

a_beautiful_rhind@reddit

I did deepseek v2.5 and the 235b only. For the 30b I could run the whole thing on GPU at full precision. Didn't bother with it beyond testing on OR.
View on Reddit #56265789

Dyonizius@reddit

i guess you get the same speed than running single node except with more ram right? 
View on Reddit #56265928

a_beautiful_rhind@reddit

More. I tried putting it on one with isolate instead of distribute and it was slower.
View on Reddit #56268541

Dyonizius@reddit

so 6-channels... I guess you're getting the same results as single node except with more ram? what kind of performance you get on 30b moe?
View on Reddit #56265788

Willing_Landscape_61@reddit

How does it compare, speed and quality, with a Q2 of DeepSeek v3 on your server?
View on Reddit #54978297

MR_-_501@reddit

What specs?
View on Reddit #54970857

Dangerous_Bunch_3669@reddit

I got 20-30 TPS with Snapdragon X Elite laptop. Lenovo Yoga Slim 7x 32gb ram. Pretty incredible model, and the fact I can run it on my tiny laptop is freaking crazy.
View on Reddit #63094599

Boricua-vet@reddit

https://preview.redd.it/4b82mxivag6f1.png?width=1592&format=png&auto=webp&s=852ce754945d1ab9c17fbe32a7dd24dc4cfbc81c I am getting an average of 40 TPS on dual P102-100 in Ollama. I cannot believe the performance on my 70 dollar investment for two of these cards.
View on Reddit #58772147

Boricua-vet@reddit

https://preview.redd.it/p5e2hj1zcg6f1.png?width=1144&format=png&auto=webp&s=ee75fd666cfc13448aff52a53236af711a9439cb 44 TPS using llama.cpp, on the same two P102-100.
View on Reddit #58772230

emaiksiaime@reddit

What backend? ollama only serves q4, have you setup vlllm or llama.cpp? what is your setup?
View on Reddit #56780285

AlgorithmicKing@reddit (OP)

i provided the link in the post, ollama can pull ggufs from hugging face, and in the ollama model registry, if you press the view all models button, you can see more quants.
View on Reddit #56793290

emaiksiaime@reddit

Thanks, never noticed that before! Q4 to Q8 is a big jump, wish they would put the q6 quand on ollama, I might try the gguf from hf but I am not too sure about setting up modelfiles for ggufs
View on Reddit #56971446

ForsookComparison@reddit

Kinda confused. Two Rx 6800's and I'm only getting 40 tokens/second on Q4 :'(
View on Reddit #54963148

sumrix@reddit

34 tokens/second on my 7900 XTX via ollama
View on Reddit #54968768

ForsookComparison@reddit

That doesn't sound right 🤔
View on Reddit #54988341

sumrix@reddit

LLM backends are so confusing sometimes. QwQ runs at the same speed. But some smaller models much slower.
View on Reddit #54991505

Jimster480@reddit

Well this is because LM studio just reports generation speed and nothing else.
View on Reddit #56633197

Jimster480@reddit

Which tokens are you referring to? Generation speed or what? Since 36tk/s is generation speed.
View on Reddit #56633107

MaruluVR@reddit

There are people reporting getting higher speeds after switching away from ollama.
View on Reddit #54994811

HilLiedTroopsDied@reddit

4090 with all layers offloaded to gpu, 117tk/s, offload 36/48 which will hit cpu (9800x3d + pc6200 cas30) does 34tk/s
View on Reddit #54994856

Deep-Technician-8568@reddit

I'm only getting 36 tk/s with 4060 ti and 5060 ti.
View on Reddit #54970243

zachsandberg@reddit

I'm getting ~8 t/s with qwen3:235b-a22b on CPU only. The 30B-A3B model about 30 t/s!
View on Reddit #55020906

Radiant_Hair_2739@reddit

Hello, what's CPU are you using? In my Xeon 2699v4 dual with 256gb RAM, I'm getting about 10 t/s - 30B-A3B model and 2.5 t/s - 235b model.
View on Reddit #55567860

zachsandberg@reddit

Hello, I have a single Xeon 6526Y and 512GB of DDR5.
View on Reddit #55598100

Jimster480@reddit

Six tokens a second generation speed, and if so, at what contact size.
View on Reddit #56631689

Wonderful_Ebb3483@reddit

Tested today on my macbook pro with m4 pro cpu and 48 GB RAM and using mlx 4-bit quant. The results are 70 tokens/second and they are really good. Future is open source
View on Reddit #54991277

Jimster480@reddit

What size context are you running?
View on Reddit #56631613

pkmxtw@reddit

15 t/s tg speed should be achievable by most dual-channel DDR5 setups, which is very common for current-gen laptop/desktops. Truly an o3-mini level model at home.
View on Reddit #54962149

nebenbaum@reddit

Yeah. I just tried it myself. Stuff like this is a game-changer, not some huge ass new frontier models. This runs on my i7 ultra 155 with 32GB of ram (latitude 5450) at around that speed at q4. No special GPU. No Internet necessary. Nothing. Offline and on a normal 'business laptop'. It actually produces very usable code, even in C. I might actually switch over to using that for a lot of my 'ai assisted coding'.
View on Reddit #54975093

whitemankpi@reddit

Could you briefly describe the installation process? 
View on Reddit #56482751

Jimster480@reddit

Basically, you just install LM Studio or MSTY.
View on Reddit #56630941

whitemankpi@reddit

Could you briefly describe the installation process? 
View on Reddit #56482744

maikuthe1@reddit

Is it really o3-mini level? I saw the benchmarks but I haven't tried it yet.
View on Reddit #54962438

numsu@reddit

It went into an infinite thinking loop on my first prompt asking it to describe what a block of code does. So no. Not o3-mini level.
View on Reddit #54984822

toothpastespiders@reddit

Yet another person chiming in that I had the same problem at first. The issue for me wasn't just the samplers. I also needed to change the prompt format to 'exactly' match the examples. I think there might have been an extra line break or something compared to standard chatml. I had the issue with this model and the 8b. Fixed it for me with this one, but I haven't tried with 8b again.
View on Reddit #55034084

Tactful-Fellow@reddit

I had the same experience out of the box; tuning it to the recommended settings immediately fixed the problem.
View on Reddit #55000465

Thomas-Lore@reddit

Wrong settings most likely, follow the recommended ones.
View on Reddit #54988099

Historical-Yard-2378@reddit

As they say in spain: no.
View on Reddit #54962620

_w_8@reddit

they don't even have electricity there
View on Reddit #54967944

economic-salami@reddit

Brutal
View on Reddit #54977174

dankhorse25@reddit

¿?
View on Reddit #54999737

thebadslime@reddit

At some tasks? yes. Coding isn't one of them
View on Reddit #54964040

sundar1213@reddit

Can you please elaborate on what kind of tasks this is useful?
View on Reddit #54973659

RMCPhoto@reddit

In the best cases it probably performs as well as a very good 14B across the board. The older calculation would say 30/3=10b equivalent, but hopefully there have been some moe advancements and improvements to the model itself.
View on Reddit #54979556

pkmxtw@reddit

If you believe their benchmark numbers, yes. Although I would be surprised that it is actually o3-mini level.
View on Reddit #54963655

maikuthe1@reddit

That's why I was asking, I thought maybe you had tried it. Guess we'll find out soon.
View on Reddit #54963878

SkyFeistyLlama8@reddit

I'm getting 18-20 t/s for inference or TG on a Snapdragon X Elite laptop with 8333 MT/s (135 GB/s) RAM. An Apple Silicon M4 Pro chip would get 2x that, a Max chip 4x that. Sweet times for non-GPU users. The thinking part goes on for a while but the results are worth the wait.
View on Reddit #54965539

Simple_Split5074@reddit

I tried it on my SD 8 elite today, quite usable in ollama out of the box, yes.
View on Reddit #54968802

SkyFeistyLlama8@reddit

What numbers are you seeing? I don't know how much RAM bandwidth mobile versions of the X chips get.
View on Reddit #54973913

Simple_Split5074@reddit

Stupid me, SD X elite of course
View on Reddit #55022257

UncleVladi@reddit

there is rog phone 9 and redmagic with 24gb, but i cant find the memory bandwith for them
View on Reddit #55031802

Simple_Split5074@reddit

Sorry I am an idiot, it's an SD elite of course 😔
View on Reddit #55022194

rorowhat@reddit

Is it running on the NPU?
View on Reddit #54981072

Simple_Split5074@reddit

Don't think so. Once the dust settles I will look into that
View on Reddit #55022086

pkmxtw@reddit

I'm only getting 60 t/s on M1 Ultra (800 GB/s) for Qwen3 30B-A3B Q8_0 with llama.cpp, which seems quite low. For reference, I get about 20-30 t/s on dense Qwen2.5 32B Q8_0 with speculative decoding.
View on Reddit #54966075

MoffKalast@reddit

Well then add Qwen3 0.6B for speculative decoding for apples to apples on your Apple.
View on Reddit #54982657

pkmxtw@reddit

I will see how the 0.6B will help with speculative decoding with A3B.
View on Reddit #55013390

SkyFeistyLlama8@reddit

It's because of the weird architecture on the Ultra chips. They're two joined Max dies, pretty much, so you won't get 800 GB/s for most workloads.
View on Reddit #54966434

pkmxtw@reddit

I was using Qwen2.5 0.5B/1.5B as the draft model for 32B, which can give up to 50% speed up on some coding tasks.
View on Reddit #54967396

SkyFeistyLlama8@reddit

I'm surprised a model from the previous version works. I guess the tokenizer dictionary is the same.
View on Reddit #54972376

pkmxtw@reddit

No, I meant using Qwen 2.5 32B with Qwen 2.5 0.5B as draft model. Haven't had time to play with the Qwen 3 32B yet.
View on Reddit #55012278

mycall@reddit

I wish they made language specific models (Java, C, Dart, etc) for these small models.
View on Reddit #54972323

sage-longhorn@reddit

Fine tune one and share it!
View on Reddit #54988053

Secure_Reflection409@reddit

Yeh, this feels like a mini break through of sorts.
View on Reddit #54968886

x2P@reddit

I get 18tps with a 9950x and dual channel ddr5 6400 ram
View on Reddit #55021782

dankhorse25@reddit

Question. Would going to quad channel help? It's not like it would be that hard to implement. Or even octa channel?
View on Reddit #54999640

pkmxtw@reddit

Yes, but both Intel/AMD use the number of memory channels to segregate their products, so you aren't going to get more than dual channel on consumer laptops. Also, more bandwidth won't help with the abysmal prompt processing speed on pure consumer CPU setups.
View on Reddit #55013724

IrisColt@reddit

In my use case (maths), GLM-4-32B-0414 nails more questions and is significantly faster than Qwen3-30B-A3B. 🤔
View on Reddit #54998808

rorowhat@reddit

With this big of a model?
View on Reddit #54980940

alchamest3@reddit

the dream is that it can run on my raspberry pi.
View on Reddit #54988041

shing3232@reddit

my 8845+4060 could do better with ktransformer lol
View on Reddit #54968104

kmouratidis@reddit

I got 25 t/s on low context for the q8 model.
View on Reddit #54965527

Science_Bitch_962@reddit

I'm sold. The fact that this model can run on my 4060 8GB laptop and get really really close ( or on par) quality with o1 is crazy.
View on Reddit #54962537

Secure-food4213@reddit

how much is your ram? and does it runs fine? unsloth said only Q6, Q8 or bf16 for now
View on Reddit #54968126

Science_Bitch_962@reddit

32gb DRAM and 8gb VRAM. Quality is quite good on Q4\_K\_M (lmstudio-community version), and I cant notice differences compared to Q6\_K (unsloth) for now. On Q6\_K unsloth I got 13-14 token/s. It's okay speed regarding the weak ryzen 7535HS
View on Reddit #54972072

Jimster480@reddit

What is your context size and how much are you filling it? Are you just doing random chat or are you asking complex questions?
View on Reddit #56630656

Secure-food4213@reddit

Nice
View on Reddit #54990731

AlgorithmicKing@reddit (OP)

is that username auto generated? (i know, completely off topic, but man, reddit auto generated usernames are hilarious)
View on Reddit #54965964

Science_Bitch_962@reddit

LOL it's not
View on Reddit #54972377

Hunting-Succcubus@reddit

do you like bitching
View on Reddit #54992029

ReasonablePossum_@reddit

What kind of humor is dat?
View on Reddit #55022917

Blinkinlincoln@reddit

Its part of the name, but the name is clearly a reference to JESSE PINKMAN. YO MR. WHITE!
View on Reddit #55053341

ReasonablePossum_@reddit

Damn that series is old af lol.
View on Reddit #55061582

ReasonablePossum_@reddit

Someone posted that u can unload o cpu and run q6
View on Reddit #55023570

Alex_1729@reddit

How is this model on par with o1? I'm looking at all benchmarks I know of and I'm not seeing anything out there. Plus it has a context window of 128k
View on Reddit #55010496

logseventyseven@reddit

are you running Q6?
View on Reddit #54962730

murlakatamenka@reddit

Usual diff between q6 and q8 is miniscule. But so is between q8 and unquantized f16. I would pick q6 all day long and rather fit more cache or layers on the GPU.
View on Reddit #54997400

Science_Bitch_962@reddit

Oh sorry, it's just Q4
View on Reddit #54962958

kmouratidis@reddit

I think unsloth mentioned something about only q6/q8 being recommend right now. May be worth looking into.
View on Reddit #54965574

YearZero@reddit

It looks like in unsloth's guide it says all quants are now fixed: [https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune](https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune) So if that's a reference to what you said, maybe it's resolved?
View on Reddit #54979498

kmouratidis@reddit

Yes, that was what I had seen. Edited my previous comment.
View on Reddit #54982792

Science_Bitch_962@reddit

Testing it rn, must be really specific usecase to see the differences.
View on Reddit #54972496

kmouratidis@reddit

Or it could be broken quantizations. It happens. There was a study that showed that a bad FP8 quant of Llama3-405B performed worse than a good GPTQ (w4a16) quant of Llama3-70B. Plus most quants don't run some extra stuff (adaptive/dynamic quantization, post-training) to recover performance.
View on Reddit #54973046

chawza@reddit

I have 16gb vram, can I run it?
View on Reddit #54973714

Korkin12@reddit

i run it on 3060 gaming -12gb, pretty slow but works
View on Reddit #55582958

Thomas-Lore@reddit

Why not? A lot of us run it without any VRAM. You may need to offload some to RAM to fit, but q3 or q4 should work fine.
View on Reddit #54988638

chawza@reddit

Yeah, but not a 33B model - _-. My cpu went wild running 7B models
View on Reddit #54991822

Admirable-Star7088@reddit

It would be awesome if MoE could be good enough to make GPU obsolete in favor for CPU in LLM interference. However, in my testings, 30b A3B is not quite as smart as 32b dense. On the other hand, Unsloth said many of the GGUFs of 30b A3B has bugs, so hopefully the worse quality is mostly because of the bugs and not because of it being a MoE.
View on Reddit #54967909

Klutzy_Can_5909@reddit

30B-A3B is supposed to be used as the Speculative Decoding model for 235B-A22B, to accelerate the larger model.
View on Reddit #55579461

OmarBessa@reddit

It's not supposed to be as smart as a 32B. It's supposed to be sqrt(params*active). Which gives us 9.48.
View on Reddit #54979427

mgoksu@reddit

Would you mind explaining the idea behind that calculation?
View on Reddit #55076383

OmarBessa@reddit

It's from this Stanford video at 52m. https://www.youtube.com/watch?v=RcJ1YXHLv5o
View on Reddit #55085000

mgoksu@reddit

Thanks!
View on Reddit #55177126

OmarBessa@reddit

You're welcome
View on Reddit #55177777

shroddy@reddit

How does it compare to 14b dense or 8b dense?
View on Reddit #55056221

uti24@reddit

>A3B is not quite as smart as 32b dense I feel it's not even as smart as mistral small, I done some testing for coding, roleplay and general knowledge. I also hope there is some bug in unsloth quantization.
View on Reddit #54977688

a_beautiful_rhind@reddit

Fast shitty outputs are still shitty.
View on Reddit #54986059

AppearanceHeavy6724@reddit

It is about as smart as Gemma 3 12b. OTOH Qwen 3 8b with reasoning on generated better code than 30b.
View on Reddit #54978709

yoracale@reddit

It's now fixed!!! Please redownload them :)
View on Reddit #54984384

dankhorse25@reddit

Wow! If the big corpos think that the future is solely API driven models then they have to think again.
View on Reddit #54962459

redoubt515@reddit

The locally hostable models are virtually all made by big tech. It seems pretty clear that at least at this point big tech is not 100% all in on API only. The topic of this thread (Qwen) is made by one of China's largest companies (Alibaba). Llama, Gemma, Phi, are made by 3 of America's largest corporations (all 3 are currently much larger than any of the *API only* AI companies).
View on Reddit #55196210

uhuge@reddit

but now Olmo is not bad too and it's from a startup
View on Reddit #55474949

Ace2Face@reddit

I love the way you play, choom
View on Reddit #54971918

XPEZNAZ@reddit

I hope local llms continue growing and keeping up with the big corp llms. This HAS to be privatized.
View on Reddit #54963121

throw_1627@reddit

why stress your CPU unnecessarily lets heat up the corpos GPUs
View on Reddit #55294612

redoubt515@reddit

>I hope local llms continue growing I hope so to. And I've been really impressed by the progress over the past couple years >..and keeping up with the big corp llms. Admittedly a little pedantic of me but the "Local LLMs" ***are*** the "big corp LLMs" at the moment: * Qwen = Alibaba (one of the largest corporations in the world) * Llama = Meta (one of the largest corporations in the world) * Gemma = Google (one of the largest corporations in the world) * Phi = Microsoft (one of the largest corporations in the world) The two exceptions I can think of would be: * Mistral (medium sized French startup) * Deepseek (subsidiary of a Chinese Hedge Fund)
View on Reddit #55195229

CacheConqueror@reddit

Anyone tested it on Mac?
View on Reddit #54967227

_w_8@reddit

running in ollama with macbook m4 max + 128gb [hf.co/unsloth/Qwen3-30B-A3B-GGUF:Q4\_K\_M:](http://hf.co/unsloth/Qwen3-30B-A3B-GGUF:Q4_K_M:) 62t/s
View on Reddit #54968433

ffiw@reddit

similar spec, lm studio mlx q8, getting around 70t/s
View on Reddit #54985563

Wonderful_Ebb3483@reddit

Yep, same here 70t/s with m4 pro running through mlx 4-bit as I only have 48 GB RAM
View on Reddit #54991542

Zestyclose_Yak_3174@reddit

That speed is good, but I know that MLX 4-bit quants are usually not that good compared to GGUF files, what is your opionion on the quality of the output? I'm also VRAM limited
View on Reddit #55022426

Wonderful_Ebb3483@reddit

good for most of the things, it's not Gemini Pro 2.5 or o4 mini quality. I have some use cases for it, I will check gguf files, higher quants and unsloth version and compare. Thanks for the tip
View on Reddit #55253884

Recluse1729@reddit

Playing around with [hf.co/unsloth/Qwen3-30B-A3B-GGUF:Q6\_K](http://hf.co/unsloth/Qwen3-30B-A3B-GGUF:Q6_K) : on my M4 Max with 48GB using: \`\`\`bash llama-cli -m Qwen3-30B-A3B-Q6\_K.gguf \\ \-co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." \\ \-fa -ngl 99 -n 2096 -b 512 -c 32768 --tensor-split 0.5,0.5 \`\`\` and got: \`\`\`bash llama\_perf\_sampler\_print: sampling time = 90.86 ms / 743 runs ( 0.12 ms per token, 8177.60 tokens per second) llama\_perf\_context\_print: load time = 1865.83 ms llama\_perf\_context\_print: prompt eval time = 28270.11 ms / 232 tokens ( 121.85 ms per token, 8.21 tokens per second) llama\_perf\_context\_print: eval time = 65665.89 ms / 4726 runs ( 13.89 ms per token, 71.97 tokens per second) llama\_perf\_context\_print: total time = 141292.46 ms / 4958 tokens \`\`\` Seems to use \~26GB of RAM with that.
View on Reddit #54999077

OnanationUnderGod@reddit

lm studio, 128 GM M4 max qwen3-30b-a3b-mlx i got 100 and 93.6 t/s on two prompts. when i add the Qwen3 0.6B MLX draft model, it goes down to 60 t/s https://huggingface.co/lmstudio-community/Qwen3-30B-A3B-MLX-4bit
View on Reddit #54996870

WashWarm8360@reddit

How much ram it takes? I have 16GB ram and Q4 can't be loaded.
View on Reddit #55110115

Luston03@reddit

It should be like 14.7 GB
View on Reddit #55236877

Equivalent_Fuel_3447@reddit

I hate that every LLM generating responses moves text up with every line. View shout stay in PLACE for god damn it, until I move it to the bottom. I can't read if it's jumping like that!
View on Reddit #55184873

fatboy93@reddit

My issue with this at the moment is that it spits a good enough summary of a document and when I ask to expand certain stuff it'll straight spit out garbage like: ********* This is on a MacBook pro M1 with 32gb ram.
View on Reddit #55129257

AlgorithmicKing@reddit (OP)

wait guys, i get 18-20 tps when i try on the pc owui is running on (instead of on my laptop which was using the.... I don't know how to say this but basically I was using owui on [192.168.0.4:8080](http://192.168.0.4:8080) on my laptop)
View on Reddit #54963389

Klutzy_Telephone468@reddit

Does it use a lot of CPU? Last I tried to run a 32b model my MacBook (64gb ram) was at constant 100% CPU usage.
View on Reddit #55021045

AlgorithmicKing@reddit (OP)

not really, but on average it's about 60%. sometimes gets to 80%
View on Reddit #55071928

Klutzy_Telephone468@reddit

Tried it again today. Started at 41% and gradually as qwen kept thinking(this model thinks a lot) it gradually climbed to 85% when I killed it. It was pretty fast though Specs: M1 Pro - 64gigs RAM
View on Reddit #55081269

uti24@reddit

But is this model good? I tried quantized version (Q6) and it's whatever, feel less good than mistral small for coding and roleplay, but faster for CPU-only.
View on Reddit #54970329

cmndr_spanky@reddit

Try regular qwen 32b for coding.. it beats everything else according my tests.
View on Reddit #55032687

ShengrenR@reddit

Make sure you follow their rather-specific set of generation params for best performance - I've not yet spent a ton of time with it, but it seemed pretty competent when I used it myself. Are you running it as a thinking model? Those code/math/etc benchmarks will specifically be with reasoning on I'm sure.
View on Reddit #55014813

Thomas-Lore@reddit

Make sure you use the recommended settings and is pretty similar to qwq.
View on Reddit #54987984

AlgorithmicKing@reddit (OP)

in my experience, its pretty good, but I may be wrong because i haven't use many local models (i always use gemini 2.5 pro/flash) but if mistral small looks better than it for coding then, they may have faked the benchmarks.
View on Reddit #54970523

shing3232@reddit

You might need flashattention for cpu to get that back lol
View on Reddit #54968164

Thomas-Lore@reddit

I was just thinking this is way to slow for ddr5. :)
View on Reddit #54965837

Denelix@reddit

AMD CPU? 🥺 9800x3d more specifically?
View on Reddit #55000045

AlgorithmicKing@reddit (OP)

that's more powerful than mine, but you got to have at least 32 gb ram
View on Reddit #55071851

Professional_Field79@reddit

what UI are you using? looks cool.
View on Reddit #55070840

AlgorithmicKing@reddit (OP)

[Qwen3-30B-A3B runs at 12-15 tokens-per-second on CPU : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1kag4er/comment/mpsw3e7/)
View on Reddit #55071604

MHW_EvilScript@reddit

What frontend is that?
View on Reddit #55068195

AlgorithmicKing@reddit (OP)

[OpenWebUI](https://github.com/open-webui/open-webui), i am surprised you didn't know already, in my opinion its the best ui out there.
View on Reddit #55069553

MHW_EvilScript@reddit

Thanks! I usually only fiddle with backends and architectures, but I’m really detached from real products that utilize those, that’s the life of a researcher :)
View on Reddit #55069678

cosmicr@reddit

This makes me feel ill. I'm getting only 20tk/s on my 5060 ti 16gb. Why did I waste my money? Am I doing something wrong?
View on Reddit #55056364

noage@reddit

It sounds like you are offloading from your gpu to get speeds like that.
View on Reddit #55065722

DaMindbender2000@reddit

Has anyone tested it with a 3090 so far?
View on Reddit #55022470

hexaga@reddit

Yea I get ~145 t/s gen speed with sglang, w4a16.
View on Reddit #55063391

IrisColt@reddit

Inconceivable!
View on Reddit #54961831

AlgorithmicKing@reddit (OP)

I know. Comparing it to SkyT1 flash 32b (which only got like 1 tps), it's an absolute beast
View on Reddit #54962044

skinnyjoints@reddit

Is SkyT1 a good model? I thought it was more of a demonstration that reasoning models were easy and cheap to make.
View on Reddit #55060769

cddelgado@reddit

"I do not think that word means what you think it means."
View on Reddit #54963735

Key_Papaya2972@reddit

I get 20-25 t/s by 14700kf+3070, all experts offload to CPU. The CPU easily runs at 100% and GPU under 30%, and prompt eval phase are slow compared to fully GPU offload, but definitely faster than pure CPU. still wonder how MoE works and where the bounds locate.
View on Reddit #55059777

Brahvim@reddit

I got nearly 6 tokens a second running Gemma 3 1b q4_k_m on my PHONE last night! (CPH2083, Oppo A12, 3 GiB RAM, some PowerVR GPU that could get 700 FPS simulating like 300 cubes with a Java port of Bullet Physics in VR. Not exactly amazing these days. Doesn't even have Vulkan support yet! Phone is a *SUPER BUDGETY*, like 150 USD, from 2020. Also by the way, Android 9.) Firefox had worse performance rendering the page than the LLM's own speed LOL. Did take nearly 135 seconds for the first message since my prompts were 800 tokens. I could bake the stuff into the LLM with some finetuning I guess. Never done that unfortunately. (On my 2021 HP Pavilion 15 with a Ryzen 5 5600H, 16 GiB of RAM, and a 4 *GB* VRAM GTX 1650 - mobile, of course, a TU117M GPU - THAT runs this model at 40 tokens a second, and could probably go a lot faster. I did only dump like 24 layers though, funnily enough.) Most fun part is how much this phome struggles with rendering Android apps or running more than one app in the background LOL. There barely is more than 1 *GB* of RAM ever left. And it runs a modern LLM ***fast*** (well, at least inference is fast...!).
View on Reddit #55054780

Pogo4Fufu@reddit

I also tried Qwen3-30B-A3B-Q6\_K with koboldcpp on a Mini PC with AMD Ryzen 7 PRO 5875U and 64GB RAM - CPU-only mode. It is very fast, much faster than other models I tried.
View on Reddit #54991190

Pogo4Fufu@reddit

Processing Prompt (32668 / 32668 tokens) Generating (100 / 100 tokens)[22:33:43] CtxLimit:32768/32768, Amt:100/100, Init:0.27s, Process:24142.02s (1.35T/s), Generate:152.68s (0.65T/s), Total:24294.70s Benchmark Completed - v1.89 Results: Flags: NoAVX2=False Threads=8 HighPriority=False Cublas_Args=None Tensor_Split=None BlasThreads=8 BlasBatchSize=512 FlashAttention=False KvCache=0 Backend: koboldcpp_default.so Layers: 0 Model: Qwen3-30B-A3B-Q6_K MaxCtx: 32768 GenAmount: 100 ----- ProcessingTime: 24142.019s ProcessingSpeed: 1.35T/s GenerationTime: 152.680s GenerationSpeed: 0.65T/s TotalTime: 24294.699s
View on Reddit #55037957

dionisioalcaraz@reddit

What are the memory specs? It's always said that token generation is constrained by memory bandwidth
View on Reddit #55034335

engineer-throwaway24@reddit

Which backend do you use, how did you set it up?
View on Reddit #55032404

lucidzfl@reddit

Would this run any faster - or more parallel with something like a AMD Ryzen Threadripper 3990X 64-Core, 128-Thread CPU? #
View on Reddit #54979102

HilLiedTroopsDied@reddit

most llm engines seems to only make use of 6-12 cores what from I've observed. It's the memory bandwidth of the cpu host system that matters most. 4 channel or 8 channel or even 12 channel epyc (does threadripper pro go 12 channel?)
View on Reddit #54997025

lucidzfl@reddit

thanks for the explanation! Is there an optimal prosumer build target for this? LIke threadripper 12 core - XYZ amount of ram at XYZ clock speed?
View on Reddit #55004322

HilLiedTroopsDied@reddit

Mac studio or similar with a lot of ram. Used epycs with ddr5 still expensive. epyc 9354 can do 12 channel ddr5-4800. Cheapest used.
View on Reddit #55029373

onewheeldoin200@reddit

I can't believe how fast it is compared to any other model of this size that I've tried. Can you imagine giving this to someone 10 years ago?
View on Reddit #55029331

ReasonablePossum_@reddit

Altman be crying in a corner. Probably gonna call Amodei and will go hand in hand to the white house to demand protection from evil china.
View on Reddit #55023471

Anada01@reddit

What about Intel iris Xe with 16 gigs of ram? Will it work?
View on Reddit #55022614

OneCuriousBrain@reddit

What is A3B in the name?
View on Reddit #54971835

Glat0s@reddit

30B-A3B = MoE with 30 billion parameters where 3 billion parameters are active (=A3B)
View on Reddit #54972978

OneCuriousBrain@reddit

UNderstood. Thank you bud. One more question -> does this mean that at a time, it will only load 3B parameters in memory?
View on Reddit #55000369

Zestyclose_Yak_3174@reddit

No, it needs to fit the whole model inside of your (V) RAM - it will have the speed of a 3B though.
View on Reddit #55022082

AxelBlaze20850@reddit

I've 4070 Ti and intel i5-14kf. Which exact model version of qwen3 would efficiently work on my machine? If anyone replies, i appreciate that. Thanks.
View on Reddit #55020053

meta_voyager7@reddit

how much VRAM is required to fit it fully in gpu for practical llm applications?
View on Reddit #55015346

ghostcat@reddit

Qwen3-30B-A3B is very fast for how capable it is. I’m getting about 45 t/s on my unbinned M4 Pro Mac Mini with 64GB Ram. In my experience, it’s good all around, but not as good as GLM4-32B 0414 Q6_K on one-shoting code. That blew me away, and it even seems comparable to Claude 3.5 Sonnet, which is nuts on a local machine. The downside is that GLM4 runs at about 7-8 t/s for me, so it’s not great for iterating. Qwen3-30B-A3B is probably the best fast LLM for general use for me at this point, and I’m excited to try it with tools, but GLM4 is still the champion of impressive one-shots on a local machine, IMO.
View on Reddit #55007685

OkActive3404@reddit

Qwen rlly cooked with the qwen 3 release unlike meta with their llama 4
View on Reddit #55005470

Iory1998@reddit

u/AlgorithmicKing Remember, speed decreases as context window get larger. Try the speed at 32K and revert back to me, please.
View on Reddit #54963268

Mochila-Mochila@reddit

How to offset this ? Beside faster DRAM, would more CPU cores help ?
View on Reddit #54999579

myfunnyaccountname@reddit

It's insane. Running an i7-6700k, 32 GB ram and an old nvidia 1080. Running it in ollama, and it's getting 10-15 on this dinosaur.
View on Reddit #54998105

Smile_Clown@reddit

strawberry... Jesus, would you guys stop already? It's not a real test. Are you that youtuber who asks 'test' questions he doesn't know the answer to also? That said, thanks for the demo...
View on Reddit #54996380

TV4ELP@reddit

It's not a real test because enough models still get it wrong? It's a test like any other test. It's not wrong to test a know weakness. It's not the only test being done. It's one of many.
View on Reddit #54997576

FearlessZucchini3712@reddit

How does it run on Mac M1 Pro?
View on Reddit #54997082

FluffnPuff_Rebirth@reddit

Yeah, I am going low core count/high frequency threadripper for my next build. Should be able to game alright, and as a bonus I can buy as many Chinese M.2 to x16 PCIE adapters as I want to.
View on Reddit #54997073

Charming_Jello4874@reddit

Qwen excitedly pondered the epistemic question of "what is eleven" like my 16 year old daughter after a coffee and pastry.
View on Reddit #54995373

brihamedit@reddit

Is there a tutorial how to set it up?
View on Reddit #54964075

jacobpederson@reddit

Yup. ollama run qwen3:30b-a3b :D [https://ollama.com/library/qwen3:30b-a3b](https://ollama.com/library/qwen3:30b-a3b)
View on Reddit #54972949

brihamedit@reddit

thanks
View on Reddit #54989081

yoracale@reddit

Yes here it is: [https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune](https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune)
View on Reddit #54984464

brihamedit@reddit

thanks
View on Reddit #54989043

Away_Expression_3713@reddit

Onnx available?
View on Reddit #54988506

drazzolor@reddit

How?
View on Reddit #54987833

ranakoti1@reddit

can anyone guide me through the settings in LMStudio. I have alaptop with 13700HX cpu, 32gb ddr5 4800 and nvidia 4050 with 6 GB Vram. at default i am getting only 5 tok/sec but i feel i could get more than that.
View on Reddit #54984392

Fade78@reddit

The speed will drop with context size. This test list be done with a full context.
View on Reddit #54982482

250000mph@reddit

I run a modest sytem -- 1650 4gb, 32gb 3200mhz. I got 10-12 tps on q6 after following unsloths's guide to offload all moe layers to cpu. All the non-moe and 16k context fit inside 4gb. its incredible, really.
View on Reddit #54975797

Eradan@reddit

Can you point me at the guide?
View on Reddit #54977112

250000mph@reddit

[here](https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune) Basically add this argument to llamacpp -ot ".ffn_.*_exps.=CPU"
View on Reddit #54981898

slykethephoxenix@reddit

Is it using all cores? The AMD Ryzen 9 7950x3d has 16 cores. Pretty impressive either way.
View on Reddit #54975501

Willing_Landscape_61@reddit

Cores are usually useful for pp but tg is RAM bandwidth constrained.
View on Reddit #54978613

HumerousGorgon8@reddit

I wish I could play around with it but the SYCL backend for Llama.CPP isn’t building RE docker image :(
View on Reddit #54977109

Rockends@reddit

One question in this thing spit out garbage, I'll stick to 32b
View on Reddit #54976878

jay-mini@reddit

15t/s on AMD Ryzen 7 7730U + 32Gb - Q4
View on Reddit #54975017

RickyRickC137@reddit

So the benchmark says this MOE is comparable to the dense 32b model in performance? But we can run it faster and on less gpu?
View on Reddit #54973592

Key-Painting2862@reddit

some information about how it running to the CPU? I want some theorical.
View on Reddit #54972510

merotatox@reddit

I wonder Where's openai and their opensource model after this release
View on Reddit #54972344

Roubbes@reddit

Is 3D Cache useful for inference?
View on Reddit #54972287

nodeocracy@reddit

Well played
View on Reddit #54971155

Red_Redditor_Reddit@reddit

I'm getting about the same for me. 10-14 tokens/sec on CPU only dual 3600mhz ddr4 with a i7-1185G7. 
View on Reddit #54962766

kingwhocares@reddit

That's a 4 core PC. That's pretty good.
View on Reddit #54970473

AnomalyNexus@reddit

What’s the best way to split this? Shared layers on gpu and rest on cpu
View on Reddit #54970330

Secure_Reflection409@reddit

17 t/s on my basic 32GB laptop after disabling gpu! Insane.
View on Reddit #54967728

Commercial-Celery769@reddit

I need to test on my 7800x3d
View on Reddit #54967286

Capable-Plantain-932@reddit

How fast do other models run? Is this one faster than others?
View on Reddit #54965665

metamec@reddit

The Q6 k-quant too. I was expecting Q2 or something. 😅
View on Reddit #54965469

MuchoEmpanadas@reddit

Considering you would be using llama-cpp or something similar, can you please share the commands/parameters you used. Full command will be helpful
View on Reddit #54965056

Malfun_Eddie@reddit

The power of AI int the palm of my laptop!
View on Reddit #54964861

Luston03@reddit

How much ram it using?
View on Reddit #54964816

logseventyseven@reddit

holy shit
View on Reddit #54961935