[-]

deanpreese@reddit

Obviously great numbers for tok/sec The real question is, "how well does it work"

[-]

Very nice for misk coding tasks. Better than Glm Flash 4.7.
x4 faster than gemma4-31b with subjectively close accuracy and instruction following. Gemma is more accurate (dense). Gemma-moe was unusable (may be fixed, but i did not check updates).
I just started qwen3.6 to test... and did not stop for a hours, continue my work with it.
For last 7 hours it generate me about 1.4-1.5 million tokens.
7900XTX+mi5032gb; -ts 1,1; 1-2 threads with 128k ctx for each.
ppt - 1200-900 t\s with low context, 700-300 t\s with 90-110k filled context. avg: 701 t\s
tg - 40-23 t\s; avg: 35 t\s
Very nice release.

[-]

CalligrapherFar7833@reddit

Cant believe anything you wrote after your 4th word "misk"

[-]

Agreeable-Yak6164@reddit

Not native language. "No speaky English" (c) =)

[-]

CalligrapherFar7833@reddit

misk , t\s , etc

[-]

Agreeable-Yak6164@reddit

So on.. Are feel better now?))

[-]

CalligrapherFar7833@reddit

Do you have some brainrot formatting issues to end your sentence with ?))

[-]

Agreeable-Yak6164@reddit

Sure, a difficult past - BBC, IRC and dial-up :-p

[-]

JustSayin_thatuknow@reddit

😆

[-]

Agreeable-Yak6164@reddit

Man, you broke my evening... now i should find and rename tons of folders on my pc with name "misk" =(

[-]

CalligrapherFar7833@reddit

haha

[-]

sammyranks@reddit (OP)

That's great.

[-]

Jungle_Llama@reddit

latest llama.cpp Vulkan, unsloth Q4 XL with a single Mi50 32GB getting 75 tok/sec (prompt varies on task, I've seen 600 tok/sec) Not bad for 280 Euros for the card. Noticable improvement in speed and accuracy over 3.5. Starting to like this model an awful lot.

[-]

michaelsoft__binbows@reddit

Thats super relevant perf out of that. Wish i picked a few up back in the day, theyre too expensive to be worth grabbing at current prices

[-]

Jungle_Llama@reddit

They have risen to 420 euro here now but this card is stting in a cheap X99 with a single E5 2680 V4. Whole rig only cost me around 440 to build, had nvme and some ddr4 ram lying around. That's insane value these days considering what Qwen 3.6 35b can do.

[-]

michaelsoft__binbows@reddit

Nice. Yeah I have some old X99 boards and I got a E5 2690 v4 for $22 about 3 years ago. a few GPUs like this would be perfect to pair that with...

[-]

ubrtnk@reddit

I get about 75t/s on 2x 5060ti with 132k context but also with cheap power draw

[-]

Eversivam@reddit

how practical are 2x 5060ti ? I've been looking for people testing them but haven't found anything.

[-]

noreallyhughjackman@reddit

I think in the scheme of things theyre a little underpowered, I feel like 5070ti is in a better place (or 3080 20gb from china). But plenty good for playing around with larger context.

I was going to make a 4x5060ti 16gb system but once you get to models that size, theyre far too slow to run at usable speed (aside from batch/non realtime scenarios) so might be better to spend more on fast gpu than just sheer memory size.

[-]

ubrtnk@reddit

I normally would agree. Would have bought another 4080 or a 5070 but only had one 8-pin power to go with the last gpu that I was putting on the last 4 pcie lanes I had without bifurcation. Another 5060 made the most sense. I thought about going with an older volta or Turing card but didn't want to deal with the lack of some features

[-]

ubrtnk@reddit

This comment aged poorly in the last 2 days. I ended up returning the 5060 and found another 4080 for damn near the same price on marketplace. So as soon as my converter cable comes in I'll have qwen on 2x 4080s

[-]

ubrtnk@reddit

I originally got the first one to run the small always on models I needed like embedding, rerank and a task model for owui. Plenty fast for those models. It can also run gpt-oss:20 at like 100t/s. It was a good card to suppliment the 3090s at the time. After I got my 5090 for my gaming rig, moved the 4080 in and ran that with the 5060 in a mismatched pair. Now I'm at 2x4090s. 2x 3090s, the 4080 and the 5060 so got the second 5060 for a matched pair for always on qwen and playing around with tensor parallelism as a matched set. I got the pny version so it's also small and only takes up one 8-pin power, which was all I had left on my second psu.

[-]

Eversivam@reddit

if a model fits only 1 5060ti, will the other 5060ti help with extending the context ?

[-]

ubrtnk@reddit

Kinda. The weights split based on how you tell it to and the rest fills with context but both model and context live on both gpus.

[-]

ubrtnk@reddit

As you can see here, the context doenst split evenly. I could do my -ts split in llama.cpp like 4,5 or something similar to shift more of the weights to the 2nd 5060 if I reall wanted them to be balanced.

Here's a llama-bench example evenly split with a 2048 prompt:

CUDA_VISIBLE_DEVICES=5,6 ./llama-bench -m /mnt/storage/ai-storage/models/vision/Qwen/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf -ngl 999 -fa 1 -ts 1/1  -ctv q4_0 -ctk q4_0 -mmp 0 -p 2048 ggml_cuda_init: found 2 CUDA devices (Total VRAM: 31695 MiB):   Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15847 MiB   Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15847 MiB | model                          |       size |     params | backend    | ngl | type_k | type_v | fa | ts           | mmap |            test |                  t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | ------------ | ---: | --------------: | -------------------: | | qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | CUDA       | 999 |   q4_0 |   q4_0 |  1 | 1.00/1.00    |    0 |          pp2048 |       3337.99 ± 4.11 | | qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | CUDA       | 999 |   q4_0 |   q4_0 |  1 | 1.00/1.00    |    0 |           tg128 |         91.44 ± 0.53 |

Sorry for the formatting, can only upload one image

[-]

LiquidNeat@reddit

95 t/s on my Macbook Pro M5 with MLX, 60W power draw.

[-]

s1rlight@reddit

Stupid question, what software is the one in the screenshot?

[-]

Available-Craft-5795@reddit

Increese that temp a lil

[-]

sammyranks@reddit (OP)

0.1 was the default. Will try 0.5

[-]

OniCr0w@reddit

From the official Qwen authors:

Thinking mode (default):

General: temperature=1.0, top_p=0.95, top_k=20, min_p=0, presence_penalty=1.5
Coding/precise tasks: temperature=0.6, top_p=0.95, top_k=20, min_p=0, presence_penalty=0

Non-thinking mode:

General: temperature=0.7, top_p=0.8, top_k=20, min_p=0, presence_penalty=1.5
Reasoning tasks: temperature=1.0, top_p=1.0, top_k=40, min_p=0, presence_penalty=2.0

Important:

Keep at least 128K context to preserve thinking capabilities
Use --jinja flag with llama.cpp for proper chat template handling
Vision support requires the mmproj file alongside the main GGUF

Usage

Works with llama.cpp, LM Studio, Jan, koboldcpp, and other GGUF-compatible runtimes.

# Text only
llama-cli -m Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf \
  --jinja -c 131072 -ngl 99

# With vision
llama-cli -m Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf \
  --mmproj mmproj-Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf \
  --jinja -c 131072 -ngl 99From the official Qwen authors:Thinking mode (default):General: temperature=1.0, top_p=0.95, top_k=20, min_p=0, presence_penalty=1.5
Coding/precise tasks: temperature=0.6, top_p=0.95, top_k=20, min_p=0, presence_penalty=0Non-thinking mode:General: temperature=0.7, top_p=0.8, top_k=20, min_p=0, presence_penalty=1.5
Reasoning tasks: temperature=1.0, top_p=1.0, top_k=40, min_p=0, presence_penalty=2.0Important:Keep at least 128K context to preserve thinking capabilities
Use --jinja flag with llama.cpp for proper chat template handling
Vision support requires the mmproj file alongside the main GGUFUsageWorks with llama.cpp, LM Studio, Jan, koboldcpp, and other GGUF-compatible runtimes.# Text only
llama-cli -m Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf \
  --jinja -c 131072 -ngl 99

# With vision
llama-cli -m Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf \
  --mmproj mmproj-Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf \
  --jinja -c 131072 -ngl 99

[-]

narrowbuys@reddit

Source ?

[-]

OniCr0w@reddit

It's in the HauHauCS Uncensored versions descriptions. That's where I got it at least but he claims it's from the Qwen team and I would trust him.

[-]

Octopotree@reddit

Does it's thinking take up context space? I guess I assumed the thinking would be discarded after answering

[-]

Fun_Librarian_7699@reddit

You can choose if you want to include thinking in the context.

[-]

OniCr0w@reddit

The uncensored HauHauCS versions have recommended settings for thinking and non-thinking modes in the LLM descriptions on Hugging Face.

[-]

Glittering-Call8746@reddit

Which quant ? Unsloth ?

[-]

sammyranks@reddit (OP)

Yes Unsloth, Q5 K S.

[-]

Odd_Butterfly_455@reddit

I just got my hand on 2 Radeon r9700 pro AI I plug one today waiting for my 1200 watts power supply come next week I will post some benchmark

[-]

Evildude42@reddit

So last night, I tried the sloth version of this at 4K4K, large 5K 5K, large and 6K. That’s being split between a B50 and a b580. Obviously, the smaller ones didn’t fit in the combined memory, and the larger one had to spill over in the ram, and I didn’t notice that the 4K version was about twice as fast as the six K version something like 35 tokens versus 15 tokens per second. Temperature was .6 but every single one of them crashed. The first few sample questions it went through fine by the time I went through for my third round of questions Il mk Studio just gave up the ghost and the model crashed. Today I got the newer LM studio versions, there was no 5K so I got the 4, 6 and the eighth. They all ran slower, but none of them crashed. By the way, I’m running in windows because I can’t get undo due to work or I couldn’t get it to work with the beta.

[-]

ComfyUser48@reddit

I am getting 166 tok / sec with my 5090 (limited to 80% power), with Q5_M, 210k context, running on llama.cpp

[-]

Dany0@reddit

Did you OC memory?

[-]

sammyranks@reddit (OP)

nice

[-]

bb943bfc39dae@reddit

Have you tried NVFP4 quant? Seems a waste not to leverage the Blackwell architecture

[-]

Subject_Mix_8339@reddit

NVFP4 quant

where is this quant?

[-]

79215185-1feb-44c6@reddit

I still get around 130t/s with my 2x7900XTX. Nothing has really changed for me.

[-]

soyalemujica@reddit

What are you doing to get such nice token per second? I have a single 7900xtx and I am stuck in 65t/s which drops considerably to 40t/s at 8k context

[-]

79215185-1feb-44c6@reddit

You sir are out of VRAM.

[-]

soyalemujica@reddit

Sorry, I fixed it, now I get 170t/s with 131k context, I think I was doing wrong configuration.

[-]

MikeLPU@reddit

Fixed how?

[-]

soyalemujica@reddit

Enabled iGPU, put Discord, MS Edge, and other programs that I don't need to be on my main GPU, so my GPU vram is empty

[-]

MikeLPU@reddit

On windows? On linux with threadripper and full available 7900xtx i'm getting 90t/s max.

[-]

soyalemujica@reddit

Yeah, on Windows, I get 160t/s, as the context fills up it lowers ofc

[-]

ju9io@reddit

170t/s on a single 7900 xtx?

[-]

soyalemujica@reddit

Yeah, averaging 155t/s

[-]

MikeLPU@reddit

I have too and never saw these numbers. What the trick?

[-]

sammyranks@reddit (OP)

Great.

[-]

darkgamer_nw@reddit

What chat software is the one in the picture?

[-]

Abysseer@reddit

Looks like LMStudio

[-]

ego100trique@reddit

Win + Shift + S

[-]

retireb435@reddit

Do you think Q5 S is better than Q4 XL?

[-]

GregoryfromtheHood@reddit

With llama.cpp on Ubuntu I was getting 10k pp and 200-250 t/s from some quick tests on my 5090 without optimising anything yet. You using linux or windows?

[-]

Adventurous_Farm3073@reddit

I get around 120t/s on my dual5070 ti+ 5060ti system. My Dual 5090 system gets ~180. Q8 is close to 80.

[-]

Rare_Potential_1323@reddit

Is the 5060ti for pre processing? I was going to setup my old 1070ti for that soon

[-]

swiss_aspie@reddit

How does that work ? You can use a smaller gpu to speed things up?

[-]

KvAk_AKPlaysYT@reddit

I need a 5090.

Lmk if anyone has an extra one

[-]

loveisnomorethandust@reddit

i don't even have a dedicated gpu. i need a 5090 more than this guy. give me one before him.

[-]

KvAk_AKPlaysYT@reddit

Let's split it.

We both get 16GB.

[-]

FrostTactics@reddit

I get the mental image of the two of you standing before u/sammyranks as King David. By suggesting to split the GPU, you reveal that you are not its true owner.

[-]

LostDog_88@reddit

I want one too!

although, i want to retain my GBs, lets split the 5090 to 2545s!

[-]