QwQ-32B seems to get the same quality final answer as R1 while reasoning much more concisely and efficiently

[-]

tengo_harambe@reddit

Reasoning models are sensitive to sampler settings and quantization so any sort of anecdotes really should include this information

Reply

[-]

Nice_Grapefruit_7850@reddit

Using q3km vs q8 QwQ I noticed a lot less Chinese characters in its thoughts and output

Reply

[-]

blacktothafuture@reddit

Are there any papers that talk about this? I’ve had this hunch using different ones locally but would love to read more

Reply

[-]

>Reasoning models are sensitive to sampler settings also important if you want to test qwq against r1. you of course have to use the recommended qwq settings and not your default whatever setting, then you'll realize it actually reaches the regions it claims to reach.

Reply

[-]

VoidAlchemy@reddit

Agreed, I used sampling info suggested on original model card, no system prompt, and bartowski's `IQ4_XS` with `q8_0` kv-cache in [this quick flappy birds demo](https://www.reddit.com/r/LocalLLaMA/comments/1j4h5s5/qwq32b_flappy_bird_demo_bartowski_iq4_xs_32k/) video so you can see results for yourself. The jury is still out, but given over 30 tok/sec with `32k` context its a top choice for the under ~70B class models

Reply

[-]

outworlder@reddit

"Reasoning more concisely"? That's not what I see. It overthinks like a person with anxiety who just took a red bull.

Reply

[-]

Conscious_Cut_6144@reddit

Definitely overthinking for me too.

Reply

[-]

Usual_Maximum7673@reddit

I think the overthinking happens when you use a quantized version. The non-quantized doesn't overthink - and doesn't go into a loop.

Reply

[-]

Devonance@reddit

I'm confused, OP, please tellme how this is concise thinking? It thinks more than. DeepSeekR1(real one) and Claude 3.7 (reasoning)... am I just using it wrong? I see so.amny praising it, it is good I agree, but it is not better than qwen coder 2.5 32b q4 in my experience for answer/time outcome. Please, I'd love to get a better model than just qwen coder2.5 32b.

Reply

[-]

Actual-Lecture-1556@reddit

*...but wait, ...*

Reply

[-]

expertsage@reddit

In my testing it performs really good for a 32B model, but still far from R1's creativity or "smartness". Mostly testing on writing tasks.

Reply

[-]

Caffeine_Monster@reddit

It's significantly worse than R1 in any real world scenario - and not just writing tasks. It's still a very impressive model, but it's nowhere near R1. Likely another case of overfitting against the style of tasks used in the benchmarks.

Reply

[-]

Pyros-SD-Models@reddit

Since it comes close to R1 in our three internal benchmarks, you sure have some examples of which qwq "siginifcantly worse"? Would love to build a new benchmark

Reply

[-]

power97992@reddit

What really matters is the model better than or equal to r1 at coding and math ?

Reply

[-]

Caffeine_Monster@reddit

Depends on the use case. For general chat I test heavily for reasoning and recall over niche knowledge. Model size does tend to have a noticeable impact. To put simply - given a complex problem QWQ make a lot more mistake than R1.

Reply

[-]

waywardspooky@reddit

damn was just about to check to see how it's performing in creative writing benchmarks

Reply

[-]

Foreign-Beginning-49@reddit

See how it does with your style of writing. There is no reliable benchmark for something like this quite subjective activity. Best wishes.

Reply

[-]

ViperAMD@reddit

No matter what prompt or model used, gptzero.me will detect it. Not a big deal, but no not is truely creative in terms of unique content.

Reply

[-]

AppearanceHeavy6724@reddit

This is bad take; you should never use the output of a model straight, without editing.

Reply

[-]

Outrageous-Wait-8895@reddit

Why is beating gptzero important to you?

Reply

[-]

ThePixelHunter@reddit

And R1 goes undetected?

Reply

[-]

KeyTruth5326@reddit

Same. Given its parameters amount, QwQ truly performs well but still can not catch up with those massive models.

Reply

[-]

Hoodfu@reddit

"More efficient". On the q8 with ollama, I asked it "How many wheels on a typical bus?". 29 paragraphs later: A typical standard bus, such as a city transit or school bus, generally has \*\*four wheels\*\*. These are arranged on two axles: one at the front and one at the rear. However, larger buses (e.g., double-decker or long-distance coaches) may have additional axles for stability, leading to more wheels. Articulated buses (with a flexible joint connecting two sections) often have \*\*six wheels\*\* due to their extended length and multiple axles. Still, for most common urban or school buses, the standard configuration is \*\*four wheels\*\*. The count can sometimes be confused with "dual tires" on the same axle (common in heavy vehicles), but these are still counted as single wheels with additional tires for load support.

Reply

[-]

markole@reddit

The issue might be with Ollama and not the new QwQ itself: https://github.com/ollama/ollama/issues/9530

Reply

[-]

snowcountry556@reddit

I really hope this is it, as I've not been at all impressed with the Ollama 4\_K\_M version of QWQ, as it just gets stuck thinking to itself for ages, and then outputs mediocre results, even with the recommended settings.

Reply

[-]

markole@reddit

Same here, I'm getting better experience with the mistral-small:24b honestly.

Reply

[-]

Dogeboja@reddit

Why is Ollama always like this. Rushing to get the model out and giving people false first impressions!

Reply

[-]

Hoodfu@reddit

I'd have to agree. If the config is wrong on this one, it's probably the 5th time I've downloaded a model again a week later and it was way better.

Reply

[-]

nuclearbananana@reddit

I just tried on the website and got 39 paragraphs. Although many of them weren't more than a sentence. Still a lot of overthinking

Reply

[-]

No-Mountain3817@reddit

Following changes to ollama Modelfile fixed the problems for me: `SYSTEM You are a helpful and harmless assistant.` `PARAMETER stop <|im_start|>` `PARAMETER stop <|im_end|>` `PARAMETER num_ctx 16384` `PARAMETER repeat_penalty 1.0`

Reply

[-]

No-Mountain3817@reddit

https://preview.redd.it/l2r04wpzy0ne1.png?width=1746&format=png&auto=webp&s=52c91fef7d310b44ce1b64d9fe3153d471e8e0ee

Reply

[-]

108er@reddit

I had to lower the temperature to 0.3, or else it was giving out all sorts of nonsense in the response.

Reply

[-]

pigeon57434@reddit (OP)

you shouldn't set it that low here follow these settings for optimal results [https://www.reddit.com/r/LocalLLaMA/comments/1j5qo7q/qwq32b\_infinite\_generations\_fixes\_best\_practices/](https://www.reddit.com/r/LocalLLaMA/comments/1j5qo7q/qwq32b_infinite_generations_fixes_best_practices/)

Reply

[-]

DrDisintegrator@reddit

I'm trying to use it on my Ollama setup on my home PC and it just spins forever even on the simplest problems. Not sure what settings I'm missing, but clearly there is something that needs to be done to get the type of performance you are talking about.

Reply

[-]

pigeon57434@reddit (OP)

here use the settings in this post: [https://www.reddit.com/r/LocalLLaMA/comments/1j5qo7q/qwq32b\_infinite\_generations\_fixes\_best\_practices/](https://www.reddit.com/r/LocalLLaMA/comments/1j5qo7q/qwq32b_infinite_generations_fixes_best_practices/)

Reply

[-]

DrDisintegrator@reddit

Thanks!

Reply

[-]

xor_2@reddit

I haven't really played that much with 671B Deepseek-R1 and mostly played with 32B qwen and 70B llama distills and these didn't seem to think for nearly as much as QwQ does. I did check my deepseek chat history where I tested it and... imho QwQ thinks much more. If its concise or not I am not sure. When CoT is too long no one will gonna read it...

Reply

[-]

custodiam99@reddit

It uses Chinese characters in English replies randomly and that sucks.

Reply

[-]

GaragePersonal5997@reddit

I think this is a common problem with low parametric quantity models

Reply

[-]

custodiam99@reddit

I "system" instructed the model to every time avoid and translate any Chinese symbols and the problem disappeared.

Reply

[-]

Cergorach@reddit

I tried it for creative writing, it uses a LOT of thinking. That seems better then the r1 70b model. The output is also more verbose, but lacks the... Creativity? Emotion evoking? Conciseness? of r1 671b. I need to play around a bit more, but it's already impressive that at certain points it's better then a model twice it size. Expecting it to beat a model 10x it size might be a bit of hubris...

Reply

[-]

Dangerous_Fix_5526@reddit

This is from quant IQ3\_S: The skyscraper shuddered, its steel bones groaning like a dying titan as it lurched beneath me. I gripped the edge of my desk, nails biting into Formica, as fluorescent lights flickered and buzzed overhead. The floor tilted sharply, and for a heartbeat, I thought we might all plunge into some hellish freefall. Then came the sound: a low, resonant *crack*, like the universe itself splitting at its seams. The window blew inward with a roar that drowned out everything else—the screams, the shrapnel of glass, the wet *thud* as something massive and wrong slammed into my coworker’s desk beside me. Air rushed through the jagged hole where the pane once stood, carrying the stench of burnt hair and copper blood. I stared, paralyzed, as a figure emerged from the chaos, its body mangled and misshapen, limbs contorted like broken twigs left in a road. Its face was a ruin—a crumpled mask of skin and sinew, eyes glowing an unnatural viridian through the gaps. But it wasn’t the creature that froze me; it was *her*—or what remained. (and it picks up from there... over 1500 tokens output AFTER 1500 tokens of "thought") ... this model ROCKS.

Reply

[-]

Gyramuur@reddit

That's QwQ 32B?

Reply

[-]

Dangerous_Fix_5526@reddit

Hell yeah ... ; just getting the model warmed up. I stopped the post there because it gets a lot more graphic... and intense.

Reply

[-]

Gyramuur@reddit

That honestly looks pretty stellar, immensely descriptive, not clinical in the way that LLMs usually are. What kind of sampler/temp settings are you using?

Reply

[-]

Dangerous_Fix_5526@reddit

Temp .8, top k: 40, rep pen 1.1 (model seems to like higher rep pen??), topp: .95, min p : .05 ; no other samplers/parameters. The model is use correct sentence structure for the "scene", dialog is spot on, "sound words" are being used correctly, and the model's "planning for the scene" is stellar. The model checks all the boxes AND understands crafting a scene like this properly. I have also tested this model with riddles, and science too - top marks. Make the "other guys" look like they are standing still.

Reply

[-]

Gyramuur@reddit

Yeah I really want to try it, but I can't, because it just outputs its reasoning as the response and will NEVER stop reasoning. I tried the version I linked you but I also tried the bartowski version and I tried the version from Qwen themselves and they ALL just reason and will never write anything else because they're too busy "thinking".

Reply

[-]

Dangerous_Fix_5526@reddit

Try it with different prompts, I found riddles can hit 4-12k ; whereas more direct "action" prompts run 1-2 k, 3 k outside THEN OUTPUT.

Reply

[-]

Gyramuur@reddit

All right so I grabbed it from here which is the first thing that pops up when searching it in LM Studio: [https://huggingface.co/lmstudio-community/QwQ-32B-GGUF](https://huggingface.co/lmstudio-community/QwQ-32B-GGUF) I got the 4\_K\_M version. But for some reason it's only generating its reasoning as responses, and they're not inside separate reasoning blocks either; the thinking is getting outputted as the main response.

Reply

[-]

Gyramuur@reddit

Thanks! I'll play around with it a bit and report back

Reply

[-]

martinerous@reddit

Wondering how does it work if you short-circuit thinking by sending it prefilled text with \`<think>Ok, I will write something very creative now.</think>\`? (Note: short-circuiting usually can work only when running locally and having control over the template used during the generation or when using text completion instead of chat completion).

Reply

[-]

KBMR@reddit

Can you share more what you mean? I've kind of never been able to get AI to evoke emotion. It's always so dry.

Reply

[-]

Cergorach@reddit

r1 671b: Here's a vivid read-aloud text for the room at the bottom of the well, designed to hint at secrets and spark curiosity: Read-Aloud Text: *“The dim light from above barely pierces the gloom of this cramped, square chamber. A thin layer of grit grinds beneath your boots—sand, perhaps, or something far older. The walls are studded with dented, rust-caked shields that hang crookedly, their edges flaking with rot. Between them, crude graffiti mars the stone: faded symbols, tally marks, and the occasional word like ‘BEWARE’ or ‘LIES.’ To the south, a low tunnel slopes downward before twisting west into darkness. The air smells of damp iron and whispers of countless adventurers who passed through… and what they might have left behind.”* DM Notes (Hidden Secrets): * Secret Door (North): A successful DC 15 Perception check reveals faint scratches near a shield that doesn’t quite sit flush with the wall. The door opens north into a narrow passage but cannot be reopened from the other side. * Sand Clue: The “sand” is actually powdered bone (a Religion or Arcana check DC 12 reveals this), hinting at the Well’s deadly history. * Graffiti: One scrawled phrase—*“He watches”*—glows faintly under Detect Magic (a residual divination spell from a paranoid explorer). This balances atmosphere, subtle clues, and opportunities for exploration! Let me know if you’d like tweaks. 😊 ================== I've left out the large amount of thinking and just pasted in the results. QwQ32B seems like it's trying to write a novel, r1 671b understands far better what we're trying to achieve. We're playing a game where you need a relative short description that sets the atmosphere and gets people stoked to explore further. Emotional might not be the right word, evocative might fit better? I did another one and the structure QwQ32B use feels very similar.

Reply

[-]

Cergorach@reddit

For a pnp RPG (D&D) I ask a couple of questions to get the LLM primed and then give a room description that needs to be 'translated' to something that needs to be read aloud. ===================== QwQ32B: Here’s a dramatic read-aloud description of the room at the bottom of the well, tailored for atmosphere and intrigue: \--- \*\*Read-Aloud Text:\*\* \*The rope creaks as you descend into the gloom, finally landing with a soft thud on a floor that shifts beneath your boots. The air is still, thick with dust, and the faintest glimmer of light from above dwindles to nothing. Before you stretches a vast, shadowy chamber—40 feet in every direction—a place forgotten even by time itself.\* \*The walls are lined with dented shields, their edges rusted and pocked with age. Each bears the scars of battles long past: chipped wood, twisted metal, and faded sigils that whisper of forgotten causes. Between them, the stones are daubed with graffiti—names scratched in haste, crude symbols, and one phrase scrawled over and over in frantic strokes:\* \*\*“GET OUT.”\*\* \*The floor beneath your feet is a thin layer of sand, gritty and unnervingly clean, as though no foot has disturbed it for years… or decades. A narrow tunnel leads southward from the northwest corner, bending sharply west after a few paces—a route that seems to vanish into the blackness.\* \*A faint draft stirs, and you notice something strange: all eyes wander instinctively toward the southern exit, yet your gut twists with unease. The north wall feels… off. Flat. Smooth. Too perfect for this crumbling place. A secret door? But why would anyone hide a way out here? And if it’s sealed from the other side, does that mean something—or someone—is waiting above?\* \--- \### Key Elements Highlighted: \- \*\*Mystery of Exits\*\*: Emphasizes the hidden north door and the ominous south passage. \- \*\*Graffiti as Clues\*\*: The recurring \*"GET OUT"\* hints at prior danger or madness. \- \*\*Ambiance\*\*: Words like \*"forgotten by time"\* and \*"unnervingly clean sand"\* build unease. Would you like adjustments (e.g., tone, emphasis on certain features)? Let me know!

Reply

[-]

AppearanceHeavy6724@reddit

I used to dislike R1, but I was given advice to run it at T=0.2 and it is much better now. I agree R1 has more sophistication and complexity in its prose.

Reply

[-]

waywardspooky@reddit

lacking creativity in terms of prose quality or story beats?

Reply

[-]

Lissanro@reddit

I am still downloading QwQ-32B, but based on what others reported so far, I can imagine it to be on par or even a bit better than 70B distill, but nowhere close to the actual R1 671B in the real world tasks. But, much less parameters mean much higher speed, so I expect it to be a good addition to my toolbox.

Reply

[-]

ASYMT0TIC@reddit

It doesn't technically mean higher speed, since the active parameters are almost the same. It definitely means it's easier to run it locally.

Reply

[-]

frivolousfidget@reddit

I tried a code prompt locally and it failed miserably :/

Reply

[-]

Sadman782@reddit

What about in their website? Quantization issue?

Reply

[-]

Interesting8547@reddit

I think it might be also a template issue. Try the one from Huggingface and compare it with your local. [https://huggingface.co/spaces/Qwen/QwQ-32B-Demo](https://huggingface.co/spaces/Qwen/QwQ-32B-Demo)

Reply

[-]

frivolousfidget@reddit

Ok. Did one more run local and 3 more on fireworks. Fireworks runs: The first two at fireworks were as bad as my local run with default settings until I lowered the temperature. The successful firework run was at temp 0.4, top-p 0.0, playable game, everything working. Locally: My local run (MLX self-quantized Q6) used temp 0.2 and top-p 0.8, which is my standard for local code generation on Qwen 2.5 coder models. I just finished running it locally and the result now with lower temperature and high top-p is perfectly playable, the only bug is that the “Best score” feature doesn’t work everything else works flawlessly. Note that token count is very high, around 15k output tokens mostly CoT. I assume that the default settings for the clients had very high temperature which was messing up the code generation. TLDR; Be sure to set lower temperatures for coding. The local run: https://pastebin.com/2ADYk5zw

Reply

[-]

knownboyofno@reddit

Did you check the model card? [https://huggingface.co/Qwen/QwQ-32B#usage-guidelines](https://huggingface.co/Qwen/QwQ-32B#usage-guidelines) Here are the suggested settings: **Sampling Parameters**:

Reply

[-]

frivolousfidget@reddit

I probably read it before my morning coffee, I read the README twice and somehow I missed it in both occasions. Thanks for sharing

Reply

[-]

JohnnyLiverman@reddit

hes just a lil guy cut him some slack

Reply

[-]

AdOdd4004@reddit

This happens to mine too, the snake game code does not run... RIP.

Reply

[-]

Ichihara02@reddit

How can i access it? Is the one they mentioned on X the same with the one named QwQ 32b preview?

Reply

[-]

pigeon57434@reddit (OP)

no in order to access it on their website you select qwen-2.5-plus and then thinking (QwQ) alternatively its on a huggingface demo here [https://huggingface.co/spaces/Qwen/QwQ-32B-Demo](https://huggingface.co/spaces/Qwen/QwQ-32B-Demo)

Reply

[-]

Ichihara02@reddit

Thanks for the info! They just added it on the actual qwen site, they added the full QwQ 32B model.

Reply

[-]

Xandrmoro@reddit

Tell it to my openrouter budget. I tried to use it there, and it kept rambling longer than 32b r1, 70b r1 and full r1 _combined_

Reply

[-]

xpnrt@reddit

want to try it with koboldcpp but the gguf files this time are seperated into 4gb parts. Would take work with kobold, I don't want to download 20gb if it won't.

Reply

[-]

Zayaraq@reddit

I am honestly not having a lot of luck with it. I tried it a bit via openrouter and it routinely used close to 10k tokens for reasoning, failed simple stuff like the strawberry test and often just stopped working. Are there some settings that I'm missing? Haven't tested it locally yet, because I only have 20gb of VRAM and from what I've seen it might run slow, but I'll try it later.

Reply

[-]

AnticitizenPrime@reddit

I'm having trouble with Openrouter too. It keeps devolving into gibberish and Chinese. It does work locally however (though it took 35 minutes to answer one question at 2 tok/sec with the Q4 quant on my 4060ti with offloading). I have the same temp, top K, etc configured for both Openrouter and local... I think something's funky with the OR config, maybe context set too low, given that it reduces to gibberish about halfway through generation?

Reply

[-]

JacketHistorical2321@reddit

I really wish that there was some way to prevent the community from seeing any sort of data in a chart form or otherwise until a week or so goes by for real world testing just in terms of interaction. I know it's kind of a pipe dream but I feel like there's a bit of a confirmation bias that goes on when new models immediately release like this. I'm not saying you're incorrect I'm just saying that model interaction is too subjective

Reply

[-]

pigeon57434@reddit (OP)

Well here are 2 things that are 100% objective 1. both QwQ and R1 got the correct answers in all my testing 2. QwQ on average generated around 4x fewer characters in its chain of thought those are pretty objective, no?

Reply

[-]

Zayaraq@reddit

they are objective, but the sample size is a bit small to really be significant. for me it underperformed in all my standard testing prompts. I just downloaded it in LM Studio and I did the strawberry test. It gave the wrong answer 9 times and corrected itself again and again. And it did that in the output not while reasoning. On openrouter it used over 10,000 tokens for this times a few times. qwen 7b only takes a few hundred for this consistently. It's also slower than R1 qwen 32b. I do have to stress that I haven't played around with different settings for the model so my experience might not be universal.

Reply

[-]

pigeon57434@reddit (OP)

please stop using the damn strawberry test its not a good measure for intelligence or usefulness at all

Reply

[-]

Zayaraq@reddit

I mean you're right. However it is useful to check where about a model is at. I rarely see models fail this test anymore so if one does (or uses 10k to succeed) it is suspicious to me. I did use other programming tests that I do on all my models and unfortunately it underperformed in other cases as well. I did use the recommended settings now, which improved it somewhat, but I still get better results with other similar models. I will keep playing around with it tho.

Reply

[-]

JacketHistorical2321@reddit

What were your questions?

Reply

[-]

Evening_Ad6637@reddit

Nope, it’s not necessary objective, because it depends on your questions.

Reply

[-]

pigeon57434@reddit (OP)

well no because in my tests they are objective i never claimed theyre universally objective there is local vs global objectiveness and both are still objective the fact that maybe you get questions which differ in results doesnt mean that somehow my results are subjective opinion magically

Reply

[-]

ihaag@reddit

I think Deepseek is still much better, it’s doesn’t get caught in a loop anymore where I found QwQ to get caught, I still think Claude is still better than both but Deepseek is the winner still for opensource.

Reply

[-]

pigeon57434@reddit (OP)

yes i agree full R1 is obviously way better at most things especially creative writing and stuff like that but in math which is what i was testing it on here I found them to be pretty comparable

Reply

[-]

ihaag@reddit

It’s not to bad… it’s still under Deepseek r1 large, but it’s by far the best 32b I’ve ever used.

Reply

[-]

bitdotben@reddit

I‘m really not impress with its math skills. From o3mini to R1 distills (Q4), they all outperformed QwQ-32B (Q4) in solving a few cubic equations. QwQ was just to chatty and not driven enough. It’s not that it got it wrong it just never arrived at an answer. I run it for an hour on a single prompt, where it probably went through its token window many many times (gave it 16k which should be more than enough to reason through finding roots of a cubic equation) and it never stopped.

Reply

[-]

Evening_Ad6637@reddit

16k is not enough. Even if not using the full context, it’s known from qwq preview that you have to give it at least 32k context **window** to utilize the full reasoning potential. I don’t know this behaves with a maximum possible window of 128k, if there would some more reasoning capabilities emerge or what.. but anyway, 16k is definitely not enough.

Reply

[-]

bitdotben@reddit

Why can you say this so confidently (genuinely interested!)? I’ve seen R1 distills solve this and other math problems in way less then 8k token? Why does the maximum token potential from the get go limit its reasoning capabilities? That makes little sense to me.

Reply

[-]

Evening_Ad6637@reddit

Some users experienced this phenomenon (me as well) and the last time I saw an „evidence“ was when this wolf.. wolfram(?) guy (forgot his name) made his benchmarks an he clearly could show and reproduce the case. Qwq has got much better results when he extended the context window. To be fair I should mention that it depends on the inference engine and so far I only know llama.cpp based cases that has this feature/issue/whatever

Reply

[-]

bitdotben@reddit

Thanks! That’s super interesting!

Reply

[-]

pigeon57434@reddit (OP)

are you talking about the new one that came out just a few hours ago because the new one is not very chatty

Reply

[-]

bitdotben@reddit

Yes talking about the new one from today. It’s extremely chatty in its thinking when doing math. Ask it to solve this equation: „3x^3+2x^2-3x+5=0“

Reply

[-]

pigeon57434@reddit (OP)

ive been testing full R1 not even distill vs full QwQ on a lot harder math than your example today and i find that QwQ not only gets the right answer but does it more cleanly too because R1 overcomplicates the hell out of everything whereas QwQ find more efficient solutions and i noticed no chattyness inside its cot

Reply

[-]

Glum-Atmosphere9248@reddit

In my case, R1 does way better email critiques than qwq 32b. At least with a self converted ex2 4bpw quant. I used the suggested temperature topk etc. I may try tomorrow with lm studio if it supports at all this model. I saw some people saying it doesn't

Reply

[-]

BrilliantArmadillo64@reddit

In LM Studio you need to replace the system prompt with one from another Qwen model. There's some syntax error otherwise.

Reply

[-]

Dogeboja@reddit

You are not supposed to use system prompts at all with this model.

Reply

[-]

Karyo_Ten@reddit

Have you tried FuseO1-DeepSeekR1-QwQ-SkyT1-32b fused model? How did you find it compared to QwQ and DeepSeek R1?

Reply

[-]

xor_2@reddit

I think it will be the best to wait for (or make myself - bunch of...) fused models like that but with full QwQ. The QwQ preview was fine but it was obviously undercooked so for fusion and to compare with this new supposedly impressive QwQ it would be best to use it and not its older uneducated (not as much at least) brother. SkyT1 I didn't test at all but apparently it is also very good model. How would you say the current fuseo1 model you mention fares against this new QwQ?

Reply

[-]

No-Mountain3817@reddit

so far FuseO1-DeepSeekR1-QwQ-SkyT1-32b fused is ruling the local space. QWQ is some how doing poorly with Q8.

Reply

[-]

Su1tz@reddit

The full R1 or distilled?

Reply

[-]

Threatening-Silence-@reddit

It uses a LOT of reasoning tokens. I gave it 32k context for my questions and its coding answers seem pretty good, but need to get my other two 3090s hooked up for 128k context to really make full use I think. There's no way I can fit a codebase in there without it.

Reply

[-]

ankitm1@reddit

A simpler explanation for how this could be. You take R1 or a model on par with R1. Use its reasoning outputs as training data to create a new reasoning model of similar size. Bring in external high quality data from other sources too, and make sure the RL work. Distill the big model to a small 32B param. That would be better than the original model you started with. In reasoning the feedback loop when it comes to training on synthetic data is positive and self reinforcing (especially when you can automatically check the quality), hence you can pretty much keep on training to get to the best model possible. This is why o3 scores so high and o3-mini outperforms o1.

Reply

[-]

XForceForbidden@reddit

I don't feel the same, if you compare it to R1-671B. I've a question about nginx, which is not clear defined in their document, and the document is a little mislead, but can found the real answer in nginx source code. The interest part is, some reason model can answer it correctly, like R1, grok3 thinking, but the corresponding non-reason model like v3 or grok3's answer is wrong. And o3-mini/o1 is correct, gpt-4o is wrong. Sonnet 3.5/3.7 is the only non-reason model can give the correct answer.

Reply

[-]

NNN_Throwaway2@reddit

Okay, so share your prompts and sampler settings, then. Bet you won't.

Reply

[-]

pigeon57434@reddit (OP)

I used it on the huggingface demo not locally so i dont know which one the demo uses

Reply

[-]

NNN_Throwaway2@reddit

Uh huh

Reply

[-]

cantgetthistowork@reddit

I'm certain OP is talking about one of the R1 distills and not the real R1

Reply

[-]

pigeon57434@reddit (OP)

no im talking about R1 full used on the deepseek website and QwQ used on the Qwen website

Reply

[-]

Emport1@reddit

But Qwen doesn't have QwQ 32B on their website yet.. Don't tell me you used Qwen 2.5 Max for this comparison...

Reply

[-]

pigeon57434@reddit (OP)

umm what do you call this [https://huggingface.co/spaces/Qwen/QwQ-32B-Demo](https://huggingface.co/spaces/Qwen/QwQ-32B-Demo) im using qwen-32b on that

Reply

[-]

Emport1@reddit

It's 5 am for me and I'm stupid, mb. And they also do have qwq 32b on their website I don't know what I'm talking about

Reply

[-]

sammcj@reddit

Are you talking about R1 (671b), or one of the distilled qwen/llama small models?

Reply

[-]

pigeon57434@reddit (OP)

671B

Reply

[-]

Expensive-Apricot-25@reddit

The R1 you were using was a distill. Distills are bad approximations of the real thing. If deepseek had trained it from the ground up it would be a different story

Reply

[-]

pigeon57434@reddit (OP)

no i was using R1 671B on the DeepSeek website and QwQ on the Qwen website im not even talking about locally

Reply

[-]

SuperChewbacca@reddit

R1 != R1 distill.

Reply

[-]

pigeon57434@reddit (OP)

im talking about full R1 i find its better than full 671B R1

Reply

[-]

BackyardAnarchist@reddit

How's it do on coding compared to r1?

Reply

[-]

DrVonSinistro@reddit

In my tests, QwQ can 2 shots a R1 1 shot (I'm talking difficult code). Its awesome.

Reply

[-]

ConnectionDry4268@reddit

What is your difficult code

Reply

[-]

DrVonSinistro@reddit

\-Constructing a Directed Acyclic Graph (DAG) and ensuring no circular dependencies exist is critical. \-Detecting cycles efficiently (e.g., using Kahn’s algorithm or DFS with cycle detection) adds complexity. \-Ensuring that tasks execute in the correct order while allowing parallel execution requires topological sorting. \-Identifying independent tasks that can run concurrently requires graph traversal logic. \-Simulating parallel execution and correctly calculating total execution time requires efficient scheduling. etc etc

Reply

[-]

cosimoiaia@reddit

I did a 'vibe' check with some of my hardest queries. imho it's close, really close to R1, maybe even a touch more than Llama-distilled-70B , but the reasoning of R1 Is still superior. Of course we are talking about a 32B against a 671B Moe so it's really impressive, also considering that I can run it on my machine (offloading layers) with "reasonable" t/s generation and I don't have a 'real' GPU. (Ryzen 5 3400G).

Reply

[-]

llamabott@reddit

I just gave it one of my standard coding tests: Create a 3D spinning cube in python using the pygame library. Was favorably impressed. Did much, much better than Qwen 2.5 Instruct, Qwen 2.5 Coder, and a couple flavors of 32B Fuse R1 Distill had on previous tests on the same prompt (All versions of these models quantized with "IQ4\_XS" for my 24 gig card). I then "iterated" along with it for just a few rounds to go from a vanilla, statically rotating wireframe cube to a cube with different, solid colors for each of the cube's faces, to a "psychedelic" version that pulsed in size and changed its spin, etc. I only had to adjust two very specific things for it along the way like reversing the drawing order of the cube's faces. I also agree that its chain-of-thought is reasonable, and the wait does feel worthwhile. I always wanted to like R1 (the full version), but to this day, cannot find a provider for which it is not intolerably slow or unreliable (I'm almost completely over OpenRouter, for instance).

Reply

[-]

Regrets_397@reddit

I’m pretty impressed with its long reasoning, which provided info in and of itself, and the more detailed output compared to the big online models. For a 32b model, it’s very good.

Reply

[-]

Cergorach@reddit

But it also hallucinates company names or entire products. So it might be great when it actually uses the right product with the right company... ;)

Reply

[-]

shyam667@reddit

Now i'm curious about the uncensored'ness and safety layers in QwQ ? is it bad like base L3.3 when it comes to story and fiction writing. Afterall it's just a base model it definetely won't be that good but fine=tunes will give it the brain damage afterwards to write better.

Reply

[-]

M3GaPrincess@reddit

I've not had that experience, but there was a group of researchers that found that. Espescially for mathematics, qwq-32b was second only to qwen2-math:72b. I haven't had much luck with the R1 models, so maybe it's my use-case which prevents me from appreciating qwq.

Reply

[-]

pigeon57434@reddit (OP)

are you sure youre talking about the newest QwQ that came out about 5 ago because its way better than the old version by a significant margin

Reply

[-]

M3GaPrincess@reddit

I was talking about the one that came out two months ago (the 46407beda5c0 one).

Reply to Post

129 Comments