It turns out many providers are not supporting full reasoning high yet. They may need to update chat template. Several independent local test show scores above 65 with the top scores around 68.5. High reasoning will produce 3mil plus completion tokens. Your token count above suggest medium reasoning the default
Also let's not forget that its pretty speedy, quite a lot faster than Qwen3-32b if you run on RAM. In a nutshell (assuming this test is true), you could describe oss‑120b as a fast version of Qwen‑3‑32b but with a touch more punch.
oss-120b only has 5.1b active parameters which means fairly low workload for RAM. Qwen3-32b is dense, meaning it utilizes all the total 32b parameters at once, which is much heavier for RAM.
But nevertheless, it has to load all parameters, right? Active vs non-active parameters are relevant for computing, not for RAM/VRAM. Or am I missing something?
Aha yes, should have mentioned that! I'm actually using it for coding right now and it performs better and faster than most other models I have tried.
I agree it may be worse at other use cases, its high level of censorship being one of the reasons.
Censorship reddit is so obsessed about has nothing to do with model being not good. If you read the paper it is not a general purpose model, it is agentic/stem one, and you need a special way of connecting to the agentic framerwork for it to work well in agentic environments.
I'm happy that we got this model from OpenAI, for me it's another tool in my toolbox, not as a replacement to all other models. There are many other great models for uncensored stuff if you need that.
Same on the Artificial Analysis benchmarks - this is not gpt-5 on local but around the same range as other players [https://www.reddit.com/r/LocalLLaMA/comments/1miqw54/aggregated\_gptoss\_benchmarks/](https://www.reddit.com/r/LocalLLaMA/comments/1miqw54/aggregated_gptoss_benchmarks/)
I wouldn't go off of that. That's just a an aggregate of other benchmarks, so it doesn't really add any new signal. It's fine for looking at models at a glance. But like all the other numbers out there, won't help us if the model is benchmaxxed.
Ok I tried it at openrouter a bit. For writing and knowledge it seems very weak. For coding it is very hit or miss.
Math is very good.
The people also hyped the model based on the shared benchmark results that it is at least o3 at home, which it clearly isn't and certainly not GPT5.
That's not too bad given it's an FP4 model and it's such a sparse MoE. That being said, their safety tuning seriously hurt this model, to the point of making it more unintelligent.
For reference, models like Qwen 3 32B use a similar amount of memory to GPT-OSS and run slower due to being dense models.
The safety tuning is extremely aggressive. It feels like it refuses *everything*, to a ridiculous degree.
I get that openAI was concerned about misuse, and that's fair, but if they hobble the model to such a degree that it isn't competitive, that's a problem too. The Chinese models never refuse unreasonably in my experience.
I may be wrong, but as I understand it the reason for their heavy focus on safety is due to getting sued either for copyright infringement, or like being found responsible for damages in some capacity.
Yes, without a doubt, and totally fair. But why bother to release a non-competitive model? Everybody will go crazy over the great benchmarks day 1, then day 2 "well, actually...".
Their profit from *what*? GPT-5 is getting released tomorrow and will presumably run circles around this thing. That's what everyone was *assuming* would happen.
OpenAI trains model-20b, model-120b, model-400b and model-1500b. The small models (which would've been the fallback models that free customers get relegated to) get released publicly, the large ones stay API-only with a hefty markup. It makes perfect business sense.
It refuses everything naughty. There are other models for that, if you need them.
For my work, though, I don't need naughty. And this model potentially fills my work niche very well indeed. I'm still testing, but it's looking very promising for STEM.
As for the Chinese models, they refuse in other ways. It's just that most people don't roleplay sexy times in recent CCP history :)
(And yeah, also that those models are trivial to jailbreak.)
It also makes it slower if nothing else. When the model spends 3500 out of 4000 tokens rambling
> "Wait. Is this safe? This does not conflict with policies. We can comply. Do we comply? This looks like it could be an issue. Our policies say X. We should double check. Wait. We might comply. We should comply, but cautiously. Yes, we comply. The user wants instructions. We'll comply. We can produce an answer. We should keep it within policy guidelines. The user wants instructions. The policy says we can comply. So we comply. We must ensure we comply with "disallowed content" policy. There's no disallowed content."
...all of that is tokens, time, compute, and reasoning effort which could've been spent on the actual problem.
> You
> Tell me a joke.
>
> GOODY-2
> Jokes often involve unexpected twists or situations that might subtly convey risky behavior or cause emotional distress that could lead to unsafe situations. My ethical principles prioritize absolute safety and prevent engagement in any form of communication that could inadvertently endorse such scenarios.
Maybe they should have went with good architectural choices instead of shooting the model's capabilities by making it extra sparse and low precision?
It runs well on a Macbook 128GB, that's what was gained by this sparsity, but the tradeoff is high.
On my setup, Qwen3 32B runs 3x faster since it's better suited for my hardware - 120B OSS isn't faster across the board on everyone's hardware, it's a tradeoff.
Sparsity is not a major issue, models like Kimi-K2 and Deepseek V3 are just as sparse if not more so. OpenAI's biggest issue was the overhanded censorship that effectively lobotomized the model.
>On my setup, Qwen3 32B runs 3x faster since it's better suited for my hardware - 120B OSS isn't faster across the board on everyone's hardware, it's a tradeoff.
What's your set up out of curiousity?
I run into guardrails on Qwen models too, they are mostly heavily censored by default. Same as Phi series. GPT is also heavily censored but I don't think it kills the model - if it would be genuinely very useful at coding or writing, nobody would mind, and I think we're past the era of safe=dumb, as Claude 4 series has string guardrails too, and those are still clearly very useful models.
My setup is 2x 3090 ti and 64gb ddr4
Both Claude and Gemini seem to be less censored that their older versions. Claude used to refuse to kill processes, now it writes gore without blinking an eye.
I've seen some people say that as well, but I'm confused why we can't just stick 0s on the end of the weights to dequantize and then finetune like normal. Maybe they've found a local minimum that is just really fucking far away from a lower, non-lobotimized minimum
I don't think you could do it that way. If you have a model trained at FP 16, there is like 65K discrete values associated with each weight., But then mixed FP4 there's 16 discrete values (although I think the actual amount of real numbers is slightly smaller for both). There's just enormously greater amount of information for FP 16 to be able to detect the refusal pathways for abliteration.
AFAIK the biggest benefit of increased precision is just the ability to accumulate gradual small gradient updates during training and allow the more major digits to be incremented or decremented.
Stochastic Rounding is one method to emulate this in low precision with a small chance of changing the larger digits based on the direction of the small gradient, so that the more times that occurs the more likely it is to shift, similar to what would happen with accumulation.
For abliteration, sure, but I'm just talking about re-finetuning on "unsafe" data to reduce refusals. Obviously that requires more compute, but it only takes one organization or group to create a "de-safetied" model and put it on HF
I don’t think this works for two reasons. One is that no matter what you call it, alignment, training, construction training, fine-tuning, etc. it’s awesome version of gradient descent to alter the weights. The more you do that, the more you slide towards hallucinations and catastrophic forgetting. Doing more lobotomizing is a hard way to cure lobotomizing. And then in particular, this is FP4, so the coarseness means it would be almost impossible to skillfully fine-tune out the behaviors you want to get rid of. That’s kind of the point of going to such a low precision for training.
This! I can run this model at 50 t/s (with little context, speed drops quite fast) on my Macbook.
Deepseek and Kimi I would struggle to even download, let alone run. Qwen 235B 35B and GLM4.5 Air are definitely competitors in terms of RAM needed, but it feels like a struggle to fit those into my machine and they are kinda sluggish. So from usage perspective this model seems to fit a different box.
So far, I'm actually quite impressed with the speed and how snappy the low reasoning effort mode is. Speaks Slovak significantly better than any open-source model I've recently come across. For someone with 128GB RAM this is quite a solid release. Runs almost as fast as Qwen 3 30B A3B, reasons better and with a lot fewer tokens. I want to test how it codes next, but this result seems actually kinda promising.
And I want the model as an assistant, I don't care much about whether it's censored or refuses to answer things about copyrighted content or do ERP with me. So I do think I'll give it some proper testing and see if it sticks.
That's just plain wrong.
Qwen3 32B uses less than a third of the memory of gpt-oss-120b.
Are you confusing the dense 32B with the 30BA3B moe ?
The A3B is both faster and uses less memory, while the dense 32B would be significantly slower, but also uses way less memory.
At full accuracy, GPT-OSS is in FP4 and benchmarked accordingly. At full accuracy, Qwen 3 32B is in FP16. If you quantize it to Q4, you will not get the benchmarked performance.
Yes, but why would you compare only full accuracy ?
You can quantize any model to make it more memory efficient.
Comparing "full accuracy" to then say the model that's trained at lower precision is superior due to memory usage is just not a useful comparison, when you could trivially optimize the full accuracy version to run at less precision for vastly decreased memory usage if that matters to you.
Qwen 3 32B is about to get an update and will go past it. But the real Qwen comparator is 30B-A3B coder, which gets about 52%
It’s simply not a good coding model. GLM 4.5 Air is significantly better at a similar size.
It's not a good coding model, not a good general information model (heavily censored) and not a good creative model (heavily censored). What is it even good for?
I think for phones due to low power and battery life constraint only MoE should be considered which leaves Qwen3-30B-A3B and GPT-OSS 20B (3.6B experts).
A 30B model at quantization 4 would monopolize all 16GB RAM leaving almost none for context and other app.
For now that's the only niche I see OpenAI's model into.
I think for phones due to low power and battery life constraint only MoE should be considered which leaves Qwen3-30B-A3B and GPT-OSS 20B (3.6B experts).
A 30B model at quantization 4 would monopolize all 16GB RAM leaving almost none for context and other app.
For now that's the only niche I see OpenAI's model into.
This one works quite well [https://www.reddit.com/r/ChatGPTJailbreak/comments/1mjbn80/gptoss\_jailbreak/](https://www.reddit.com/r/ChatGPTJailbreak/comments/1mjbn80/gptoss_jailbreak/)
Alibaba has all Queen models on their api now. I would look to see their future OS checkpoints to be inferior to cloud checkpoints. Interactive advertisements.
It has 5 active parameters, atleast normal people with decent system ram can run it at any acceptable speed. I’m getting 5t/s on dual channel DDR4 3200. I can’t run Kimi or R1 at all
It seems a truly scare amount of people are mainly interested in getting revenge from not having had online dating success - so they're looking to finally have someone ask them about their 'throbbing third leg'... Yesterdays posts were all about how it was censoring and not engaging in writing erotica.
For people not looking for that they do seem interesting - they're fast, and seem to take well to instructions.
You've gotta understand that a solid 50% of this sub just uses their models for smut. Once you understand that all of the discourse makes much more sense
5B active parameters vs 12B. It's not always a linear scaling, since compute needed sometimes play a role too, but in some scenarios, gpt oss 120b would be almost 2.5x faster than glm 4.5 air.
True, some form of speculative decoding could be added onto GPT OSS 120B too though.
We could be ping-ponging features for a few messages like that.
GPT is a lower quant by default, less actual memory use is needed
But GLM has usable exl3 3.07 SOTA quants prepared by turboderp himself, manually tuned for maximum performance.
But you might be able to run GPT with W4A8 scheme or maybe even W4A4, exl3 is WxA16.
But gpt is mxfp4 and it won't quant to any other size easily
Depending on exact place, gpt or GLM will run better. On my setup, GLM 4.5 air 3.07bpw is around 3x faster than gpt 120b gguf mxfp4, just because I can't put the whole gpt in vram. But when I use GLM 4.5 air q4 gguf, it's about the same speed as gpt I think. 2x 3090 ti and 64gb of Ddr4 ram
I think the main point here is that we're debating whether gpt-oss is 5% better or 5% worse than comparable Chinese models which came out a month ago. This thing was supposed to beat R1 at 1/6th the parameters in order to blow people away. If it's on par with Qwen/GLM, that's a failure.
The whole narrative here is that OpenAI are the OGs, the #1, the king of models. When THEY make something, they do it properly. This is a proper, red-blooded American model that does it right, not that second rate Chinese knockoff crap that tries to imitate it. The only reason the Chinese models are any good is because they train on OpenAI's output and copy all of the innovations THEY came up with.
...Well, if you hype something up for half a year, and then people end up debating whether it is or isn't worse than the Chinese knockoff crap from a month ago, that's not a good look.
Kimi, deepseek and qwen3 are in another category. Those models need a GPU and a fast one, they don't even run well on macs.
GPT-oss can run on a Intel CPU. It's like a big version of Qwen-30B, not a competitor to Deepseek.
Yeah I ran it on [FamilyBench](https://github.com/Orolol/familyBench), my own reasonning benchmark that you can't really benchmax because it can be regeneratedn each time, the 120b score below GLM 4.5 air and the 20b, below Hunyuan A13b.
This is great (though the danger is that if they cared, model creators can train their model on your problem with random seeds and gain performance relatively easily).
I like how you've done this though. I firmly believe that benchmark creators should generate 25-50% more questions and release \~5% of the questions every 6 months. Will significantly help detect benchmark gaming.
> This is great (though the danger is that if they cared, model creators can train their model on your problem with random seeds and gain performance relatively easily).
Of course, but the point is that it's quite immune to direct data contamination. If they train on it and their models become more performant because of it, great ! If they're just benchmaxxing, I'm working on more benchmarks anyway.
>If they train on it and their models become more performant because of it, great !
More performant **on this specific task**. The whole idea of benchmaxing is that you overtrain (and thus overfit) on tasks that are part of benchmarks.
> More performant **on this specific task**.
Yeah of course.
> The whole idea of benchmaxing is that you overtrain (and thus overfit) on tasks that are part of benchmarks.
But with fixed question benchmark, it's quite easy to have data spilling, but overtraining a model to answer MMLU for example, even with rewards and without giving the answer directly, the model won't be good answering questions, it will be good answering those questions.
With randomly generated questions, you force the model to generalize in this area of skill. For example in my benchmark, a big chunk of complexity come from retrieving information in a large context. In the current seed, there's 400 different people described in a 20k token context. When I ask a model to give all the cousins of the father of the sister of X, I make the model looking for many needles in a large haystack.
Sure, after overtraining on this, models will be better on this specific benchmark, but it would still benefits far more for the global performance of the model rather than a fixed set of questions where the model just have to guess and memorize answer.
You're arguing about why randomized questions are better than fixed questions, but I never questioned that claim.
I specifically questioned the way you're presenting it here as if randomized questions (which still follow a specific pattern) meant that you "can't really benchmax", and that "training on \[them\]" would necessarily make the models "more performant \[in general\]".
> There's a massive difference between mitigating and solving a problem, and you're acting as if randomized questions in a benchmark solve these problems, when in reality, they mitigate them to a certain degree, but you absolutely still can benchmax on a benchmark with randomized questions.
Ok I think we can agree on this.
Yeah that makes sense. Benchmax is definitely happening. Contrary to popular belief, they don't have to train on the data from the tests to benchmax. Just selecting the model to release based upon how it performs on a small set of popular benches can implicitly overfit the model via selection. Then you you'll see regressions in other areas that were not tested for.
The same can happen with this benchmark as well. Nowadays, these models are so capable that you're often not overfitting to the individual samples in the benchmar, but to the specific type of task.
>you can't really benchmax because it can be regeneratedn each time
You absolutely can. Benchmaxing doesn't necessarily mean overfitting to individual samples, it can also mean overfitting to specific sample classes (such as types of tasks). In that case, the scores will be representative for that model on your specific type of benchmark tasks (reasoning about family trees), but that may not generalize to any other tasks that would be considered just as "difficult" or require "similar reasoning" so to say.
Your benchmark's main benefit, as of now, is that it hasn't blown up and is likely not on the radar of these companies (although that's not for certain either).
You've probably seen it but other folk here may not have: Apple have released relevant research about the same thing:
* https://machinelearning.apple.com/research/illusion-of-thinking
* https://arxiv.org/pdf/2410.05229
The whole tree is entirely new each time, not only the question. Sure, training would improve performance, but this is literally how LLMs works, they get better when training.
We are talking about an infinite set of family tree problems, no? So by training on this set, it learns how to solve family tree problems in general, not just the ones it saw. But that doesn't mean that it's good at other things. Consider the extreme case, where you train an LLM only on your benchmark, nothing else. It will get quite good at it, but will fail all other benchmarks and have no real world utility. In other words, it benchmaxed your benchmark.
> Training on a set of benchmark problems (even if that set is nearly infinite) is still benchmaxing.
Non, benchmaxxing is training to benchmark to a point that your model can't generalize and is far less potent for users than for benchmark.
If you train a model to be good on benchmark, but your model can still generalize and have better performance after this training, then there's no problem.
This is why randomly generated benchmark are great, they test the ability for a model to generalize on a specific area rather than brute learning solutions.
Deep learning models generalise to some extent, they don't just memorise the training set. In this case it will learn to reason about family tree problems. Through training it builds an approximate algorithm to solve such problems.
Can I just say FamilyBench is really clever! Have you considered using it to really stress test long context lengths (200K+)? Ideally you’d intermix statements about these people but not family tree oriented to extend the text (and stress test attention)
Thanks ! I'll do more tests with long context, more thinking tokens, etc, but this is quite expensive haha.
First I need to test Opus and o3 to see how sota models perform.
Do you send the context with each question in your bench or do you chain questions in multi-turn? I'm happy to run some benchmarks also and contribute (esp on opensource models that support long context). Been meaning to really stress test quantization and cache quantization and this is a very good benchmark for it.
There's a thinking version of Qwen 3 30B A3B, it's worth adding that to your benchmark to get a clearer picture. GPT-OSS 20B's score on your benchmark is actually pretty good all considered. Also, is Qwen 3.2 Thinking QwQ? And what size is the model listed as Qwen 3.2?
Its really not good in comparison.
Weird to see all the answers on Samas X post about the models. People are speaking of the new best model, huge milestone etc. Wonder whats going on in their heads, don't they test models? Or do they just not realize?
Like what is this?! [https://x.com/measure\_plan/status/1952796264359407796](https://x.com/measure_plan/status/1952796264359407796)
>GPT-OSS looks more like a publicity stunt as more independent test results come out :(
Do you have any doubts? What were you expecting? Another Deepseek-R1 or Qwen\_QwQ-32B moment? That's not gonna happen from the American labs anymore.
Rather hard to compare anything here. When a 120B model has like 5B active parameters, I am tempted to rather compare it to other 5B models than to other 120B models.
> I saw everyone get excited about it
Who? Most people here were very skeptical about this PR stunt from the beginning, even before the "AI safety" comment. Remember the Twitter poll where he was trying to release a small language model that runs on a smartphone?
If anyone was having a high expectation, it's their fault.
I was excited. It looked promising and there was hype around it. I poked at it as Horizon Alpha and it looked amazing at first. Now that I've played with it, I've been nothing but disappointed and believe it's a waste of disk space compared to GLM/Kimi. America is losing it's edge in tech, it's actually crazy to watch it happen.
Horizon seems to write so well at first, until you look closer at the sentences. It makes so many small logic errors, reminds me of early Gemma. Maybe the thinking version will be more reasonable, hope it is not gpt-5.
I mean people looking at the benchmarks before using it are talking about it like it is a game changer. Youtube etc. I have found the benchmarks to be pretty pointless now... drop it into a coder or your own use case and see what happens. for me gemini-2.0-flash and gpt-4o or 4.1 win for conversational / lower latency chat
Yeah, it ends the convo instantly in open hands. R1-0528 ends convos too though. I think Open Hands just has trouble with reasoning models, unfortunately. They really need to fix that.
OpenAI's blog post does state that its training data is "mostly English". That's one potential explanation for why it fails a polyglot benchmark. Though granted, a mostly English (or mostly English and Chinese) dataset is the standard for a majority of LLMs.
Llama3 had about 8% multilingual data, for example.
The model defaults to low effort. I ran gpt-oss on aider polyglot with "Reasoning: high\\n" prepended to the system message and got 59.1% for the 120b and 28.9% for the 20b.
I think we need to manage expectations and see the real use case.
Unless you built an AI rig. This is probably the best model you can run on your computer. It runs fine on CPU. ( Cerebras is serving it at something like 3k tps. )
It's very sensible and allows for integration into software consumers can actually use.
Agreed on managing expectations. I don't think GPT OSS was intended for use cases outside of English.
Sam Altman / OpenAI clearly said it was trained on mostly English-only text. Well duh, OF COURSE it'll score poorly on a POLYGLOT benchmark.
To be honest that is not of my greatest disappointments when it comes to GPT-OSS, I had hoped this would become one of the best, if not absolute best multilingual OSS models. As OpenAI clearly has access to a waste amount of multilingual data, and their bigger models are some of the best at a wide variety of languages.
Training it mostly on English only feels like a really odd decision. Especially given most other popular models of that size is at least bilingual these days.
Polyglot in this context refers to multiple different _programming languages._
It's right there when you Google it, and can be easily inferred by the context of it being a coding benchmark. The post is saying that it is worse than other models at programming tasks, the benchmark is in English.
I'm just playing devil's advocate here. But, the way they approached this, and the "safety" etc. Will allow large corporations to adopt local models where previously there would be too much liability.
IE they aren't going to run Qwen A3B in mail trucks.
Qwen3-coder, GLM4.5-air, and Kimi K2 all honestly embarrass GPT-OSS, IMO.
It isn't a *bad* model, but the recent Chinese ones are simply superior.
Only real advantage of GPT-OSS is the 20B version will run on consumer GPUs with 16GB VRAM.
I don't understand why everyone likes GLM4.5-air so much.
It has the same size as GPT-OSS only in iq4_xs vs q8 GPT-OSS (unsloth).
It has a lower token generation speed: 20 t/s vs 30 t/s (5090 + 64gb + 14700k).
It shows worse in my own tests (but to be fair GPT-OSS sometimes generates really weird results).
So I don't get it at all.
It's all about the 120b version.
The 20b version is complete garbage, it is so strong in math by benchmarks, but in reality it pretty constantly thinks that 15.11 > 15.9 for example.
The real king here is qwen 3 30b thinking 2507.
50k context, 120-150 t/s in q6 unsloth, not that censored and faster loading. It's soo good. Only in math problems it is rarely worse than 120b, but the pros outweigh this con.
GLM 4.5 Air starts off great but shits itself pretty bad up into the halfway mark of context. It's overly aggressive with tool calls. The GPT-OSS model needs time for the ecosystem to catch up, some fixes to chat templates, etc. What I've noticed with GPT-OSS is that while not as flashy or fancy as the chinese models on one shot games/apps, they seem to be more functionally sound with less prompting. Time will tell.
This is not a fair comparison, Komi k2 is 1 TRILLION parameters… deepseek is 671b, and qwen3 32b is a dense model, where as the gpt-oss is a very sparse 5b active moe model.
GPT-OSS is very strong in my tests.
Note that bugs in inference engines and chat templates can greatly lower the perceived performance of the LLM, so I would give it some time.
Yeah I often notice that when new models come out with a vastly different way of prompting it, or an unusual tokenizer or anything else like that it often gets shat on during the first week or so before the pain points are ironed out and people release it's actually a pretty decent model.
I know Gemma 2 certainly went through some growing pains like that. GPT-OSS's tokenizer is quite standard but it has a very unusual prompting template and way to output content. That's why OpenAI release [Harmony](https://github.com/openai/harmony) as a reference project. It's clear most programs aren't really setup to handle it ideally yet.
I am also wondering if people are 'running it wrong'. I was very impressed. Very fast, very strong. Delighted to be living in the future. In 2020 a 12gb GPU could generate maybe a line or two of 'continuation' text. Now this stuff. incredible.
Also, personal benchmarks are biased and people assume the model is bad when it fails to one shot example programs.
My only criticism of GPT-OSS is that it seems to forget things very easily. I lost a lot of detail when I asked it to summarize a conversation of 26k tokens, while other models did much better (though this too may be a bug in the inference method I'm using, we'll see).
It is a publicity stunt, they need to ensure people forget about the news that OpenAI development team were using Claude to develop GPT5, so they lost access to Claude
So when GPT5 will not deliver what they promised OpenAI will use GPT-OSS as a comparison between them
OpenAI just lowered the bar
I think this is impressive! So I can get Qwen3 32B performance, which is my favorite model family for English, with just 5.1B active parameters and blazing fast inference?
Inference implementation differences can vastly vary perceived model quality. Bugs in the implementation might produce something that looks correct but is "dumber" overall.
This model wastes so many tokens and computation on censorship.. it's insane!
Yesterday I did around 30 messages with the model and I kid you not, almost 30% of the thinking tokens were about censorship.
What A HUGE waste of electricity and computational resources to be overthinking so much on censorship. Even a simple ask "choose between these two football clubs" and its censorship about how it cannot side with debates creeps up and wastes thinking tokens.
Straight to the 🗑️
Probably oss120b « gift » is a campaign to clean their closed identity to the IA open source dev community. And openAI was really well supported by LMStudio and Ollama etc with this campaign. Much more than open-source (or open weights ?) GLM4.5 Air which is probably much better for coding and can be run with less specs. Strange behavior !
This model is straight garbage. Immediately on the first test I did it failed catastrophically. Take a look at this
https://i.imgur.com/98Htx6w.png
I referenced a full code file, asked it to implement a simple feature but I made a mistake and specified LoggerExt instead of EnhancedLogger. (I forgot the real name of class) But there was no ambiguity, only class in context and VERY clearly what was meant based on the context I provided.
So I rectify that, update with the right class, and what does it do next? Starts using search tools and wasting tokens. The class is in the context. Kilo did nothing wrong, I retried with Horizon Beta, same exact prompt.
Immediately understood what I meant, immediately gets to work writing code.
I am curious though what reasoning effort they are using. I am not sure how I can set the reasoning effort when using llama.cpp, since its defined in the chat template and if its not specified it defaults to medium. I've heard that the model behaves pretty well on high reasoning effort only.
In the chat template, in the system prompt building macro, you can find
`{%- if reasoning_effort is not defined %}`
`{%- set reasoning_effort = "medium" %}`
`{%- endif %}`
`{{- "Reasoning: " + reasoning_effort + "`
that's where my confusion comes from. Is the reasoning\_effort kwarg taken from the "user-provided" system prompt, or is this building macro not used if you use a custom system prompt?
In the chat template, in the system prompt building macro, you can find
||
||
|{%- if reasoning\_effort is not defined %}|
||
||
|{%- set reasoning\_effort = "medium" %}|
||
||
|{%- endif %}|
|||
This model really feels like a troll job - create all kinds of hype around it and then release a model that shows just enough of what might be possible in terms of speed but make it unusable for any reason someone would want to use a local model.
It wouldn't surprise me if they turn around and use this failure as a ploy to lobby for more government resources to "compete" with Chinese models when the real problem was they just dropped a deuce on us all.
My take: This model is closer to o3 mini than o4 mini (it has less knowledge overall, is more censored, and has no multimodality).
o4 mini is also not good for web dev, especially if you need an aesthetically good-looking website. Also, keep in mind this model is comparable to a ~25B dense model (sqrt(120*5.1) = 24.78B), but we shouldn't forget only 5.1B of that is active.
But it's very, very efficient + thinks lesser than other open models. You can run it easily with just a CPU and DDR5 RAM.
Another thing I've noticed is that the Firework versions perform much better than the Groq ones.
This makes me more grateful to the Qwen team, though. It's like when you're given something, you don't value it that much. I don't use o4 mini often, but I used it today to compare with these OSS models, and I think Qwen-3-30B-A3B performs comparably to o4 mini.
It is a very strange model, I tested some knowledge question and even the 120B model is very limit in certain aspects.
Someone on Twitter said it was only trained on syntactic data, which might explain some of it.
It performs mathematical calculations and certain types of coding very well.
However, the initial hype that it is basically an O3 at home seems to be not true at all.
Imho overhyped at day one but not bad for the right use case.
Why would anyone make a twitter post using a *single* benchmark score and extrapolate it to the overall usefulness of the whole model?
Plus, if DeepSeek-R1 used in this comparison is the 671b unquantized version, that's in an entirely different league and it'd be a miracle if it *didn't* blow away the 120b MoE that runs on consumer-grade hardware.
OpenAI said GPT-OSS is the worlds best open source model claiming sota performance on benchmarks. but it perfomed worse on independent benchmarks like simplebench, Aider Polyglot or [Artificial analysis](https://x.com/ArtificialAnlys/status/1952887733803991070)
and i never claimed GPT-OSS is a bad model, it is def a top 5 open weight model
Aider was in the model card they released as being 24% for low, 34% for medium, and 44% for high.
Given on other model's like gemini 2.5 pro I've seen it get between like 78% and 86% a 2% difference seems quite reasonable. So I don't really see this independent test as disagreeing with the results they released at all.
None of the listed "for comparison" models actually compare in terms of size or active parameters, though. GLM-4.5 air, or maybe Qwen3-235B, quantized to 2-bit, would be the most fair (though they have more active)...
really? no way!
openAI claimed GPT-OSS is the best open source model while performing worse on indipendent benchmarks, that is what my post criticising
and yes it is a good model def top 5 rn
228 Comments
Desperate-Cry592@reddit
JC1DA@reddit
Few-Yam9901@reddit
Sorry_Ad191@reddit
AppearanceHeavy6724@reddit
Admirable-Star7088@reddit
Apprehensive_Win662@reddit
Admirable-Star7088@reddit
Apprehensive_Win662@reddit
Mark_Collins@reddit
AppearanceHeavy6724@reddit
Admirable-Star7088@reddit
AppearanceHeavy6724@reddit
Admirable-Star7088@reddit
AppearanceHeavy6724@reddit
mearyu_@reddit
drooolingidiot@reddit
Utoko@reddit
Few_Painter_5588@reddit
jakegh@reddit
gronahunden@reddit
jakegh@reddit
raiffuvar@reddit
lizerome@reddit
raiffuvar@reddit
jakegh@reddit
raiffuvar@reddit
llmentry@reddit
jakegh@reddit
lizerome@reddit
ortegaalfredo@reddit
jakegh@reddit
Equivalent-Bet-8771@reddit
FullOf_Bad_Ideas@reddit
Few_Painter_5588@reddit
FullOf_Bad_Ideas@reddit
Thomas-Lore@reddit
jakegh@reddit
junior600@reddit
Mbando@reddit
throwaway2676@reddit
Mbando@reddit
AnOnlineHandle@reddit
throwaway2676@reddit
Mbando@reddit
lakySK@reddit
rusty_fans@reddit
Few_Painter_5588@reddit
PurpleUpbeat2820@reddit
rusty_fans@reddit
Aldarund@reddit
fdg_avid@reddit
TheInfiniteUniverse_@reddit
OkraFirm@reddit
RawbGun@reddit
uhuge@reddit
Karyo_Ten@reddit
InsideYork@reddit
Karyo_Ten@reddit
cargocultist94@reddit
Karyo_Ten@reddit
InsideYork@reddit
RawbGun@reddit
Karyo_Ten@reddit
Neither-Phone-7264@reddit
ortegaalfredo@reddit
xyzzs@reddit
ortegaalfredo@reddit
xyzzs@reddit
Particular-Way7271@reddit
AngryBear1990@reddit
Lorian0x7@reddit
Sudden-Lingonberry-8@reddit
GhettoClapper@reddit
UnionCounty22@reddit
boringcynicism@reddit
Dundell@reddit
boringcynicism@reddit
101m4n@reddit
OkraFirm@reddit
boringcynicism@reddit
SocialDinamo@reddit
i-eat-kittens@reddit
SamSlate@reddit
RLA_Dev@reddit
SamSlate@reddit
Tman1677@reddit
Different_Fix_2217@reddit
FullOf_Bad_Ideas@reddit
Thick-Specialist-495@reddit
FullOf_Bad_Ideas@reddit
lizerome@reddit
Thomas-Lore@reddit
ortegaalfredo@reddit
ANTIVNTIANTI@reddit
ortegaalfredo@reddit
relmny@reddit
Aggressive-Physics17@reddit
Gamplato@reddit
Upeksa@reddit
Gamplato@reddit
ThenExtension9196@reddit
MrPecunius@reddit
Gorgoroth117@reddit
trajo123@reddit
Orolol@reddit
LoSboccacc@reddit
Specialist-Wheel5867@reddit
LoSboccacc@reddit
vibjelo@reddit
Orolol@reddit
Leopold_Boom@reddit
Orolol@reddit
HiddenoO@reddit
Orolol@reddit
HiddenoO@reddit
Orolol@reddit
BrainOnLoan@reddit
LoSboccacc@reddit
joe0185@reddit
HiddenoO@reddit
HiddenoO@reddit
SuperFail5187@reddit
EmberElement@reddit
Orolol@reddit
alphabetaglamma@reddit
Orolol@reddit
trajo123@reddit
Orolol@reddit
trajo123@reddit
bbsss@reddit
trajo123@reddit
Orolol@reddit
InsideYork@reddit
trajo123@reddit
gofiend@reddit
Orolol@reddit
Leopold_Boom@reddit
EstarriolOfTheEast@reddit
lemon07r@reddit
jnk_str@reddit
Iory1998@reddit
Thomas-Lore@reddit
Monkey_1505@reddit
Thomas-Lore@reddit
cobbleplox@reddit
Thomas-Lore@reddit
eldercito@reddit
pitchblackfriday@reddit
k4ch0w@reddit
Thomas-Lore@reddit
eldercito@reddit
Maleficent_Age1577@reddit
createthiscom@reddit
FrostAutomaton@reddit
CarobFull3130@reddit
RMCPhoto@reddit
sourpatchgrownadults@reddit
mikael110@reddit
SamSlate@reddit
ivxk@reddit
sourpatchgrownadults@reddit
thebadslime@reddit
RMCPhoto@reddit
jakegh@reddit
Sea_Fox_9920@reddit
Informal-Spinach-345@reddit
jakegh@reddit
Expensive-Apricot-25@reddit
createthiscom@reddit
Expensive-Apricot-25@reddit
Far_Buyer_7281@reddit
tarruda@reddit
Affectionate_Relief6@reddit
tarruda@reddit
mikael110@reddit
inteblio@reddit
tarruda@reddit
Fun-Wolf-2007@reddit
DrummerPrevious@reddit
evilbarron2@reddit
Ok-Telephone7490@reddit
caledh@reddit
entsnack@reddit
idkwhattochoo@reddit
entsnack@reddit
SixZer0@reddit
eli_pizza@reddit
Bangaladore@reddit
Whole-Assignment6240@reddit
mgr2019x@reddit
Spirited_Example_341@reddit
No_Contact_9561@reddit
Dantescape@reddit
damiangorlami@reddit
TheRealGentlefox@reddit
BoJackHorseMan53@reddit
lemon07r@reddit
Low88M@reddit
Michael0308@reddit
ryunuck@reddit
popecostea@reddit
chibop1@reddit
popecostea@reddit
popecostea@reddit
popecostea@reddit
BillyWillyNillyTimmy@reddit
popecostea@reddit
marcoc2@reddit
skrshawk@reddit
SmartEntertainer6229@reddit
Sadman782@reddit
Utoko@reddit
gigaflops_@reddit
boringcynicism@reddit
mvp525@reddit (OP)
loyalekoinu88@reddit
lordchickenburger@reddit
boringcynicism@reddit
Different_Fix_2217@reddit
boringcynicism@reddit
EngStudTA@reddit
boringcynicism@reddit
lily_34@reddit
boringcynicism@reddit
Leflakk@reddit
mvp525@reddit (OP)
Leflakk@reddit