TheaterFire

GPT-OSS looks more like a publicity stunt as more independent test results come out :(

Posted by mvp525@reddit | LocalLLaMA | View on Reddit | 228 comments

GPT-OSS looks more like a publicity stunt as more independent test results come out :(

Reply to Post

228 Comments

Desperate-Cry592@reddit

They're paving the road for for profit conversion... trying to build some goodwill before that. They'd never ever release anything useful otherwise.
View on Reddit #64069529

JC1DA@reddit

Below is my test with diff-fence format \`\`\` \- dirname: 2025-08-06-16-31-06--gpt-oss-120b test\_cases: 225 model: openai/gpt-oss-120b edit\_format: diff-fenced commit\_hash: f38200c-dirty pass\_rate\_1: 14.7 pass\_rate\_2: 46.7 pass\_num\_1: 33 pass\_num\_2: 105 percent\_cases\_well\_formed: 77.8 error\_outputs: 84 num\_malformed\_responses: 84 num\_with\_malformed\_responses: 50 user\_asks: 147 lazy\_comments: 1 syntax\_errors: 0 indentation\_errors: 0 exhausted\_context\_windows: 0 prompt\_tokens: 4255451 completion\_tokens: 865457 test\_timeouts: 0 total\_tests: 225 command: aider --model openai/gpt-oss-120b date: 2025-08-06 versions: [0.85.3.dev](http://0.85.3.dev) seconds\_per\_case: 76.8 total\_cost: 0.0000 \`\`\`
View on Reddit #63546076

Few-Yam9901@reddit

It turns out many providers are not supporting full reasoning high yet. They may need to update chat template. Several independent local test show scores above 65 with the top scores around 68.5. High reasoning will produce 3mil plus completion tokens. Your token count above suggest medium reasoning the default
View on Reddit #64058581

Sorry_Ad191@reddit

gpt-oss-120 reasoning high got 68.4% check in Aider discord
View on Reddit #63963105

AppearanceHeavy6724@reddit

41.8 is not bad. Not stellar, but not bad.
View on Reddit #63519255

Admirable-Star7088@reddit

Also let's not forget that its pretty speedy, quite a lot faster than Qwen3-32b if you run on RAM. In a nutshell (assuming this test is true), you could describe oss‑120b as a fast version of Qwen‑3‑32b but with a touch more punch.
View on Reddit #63520243

Apprehensive_Win662@reddit

Why is it faster if you run on RAM?
View on Reddit #63521433

Admirable-Star7088@reddit

oss-120b only has 5.1b active parameters which means fairly low workload for RAM. Qwen3-32b is dense, meaning it utilizes all the total 32b parameters at once, which is much heavier for RAM.
View on Reddit #63521902

Apprehensive_Win662@reddit

But nevertheless, it has to load all parameters, right? Active vs non-active parameters are relevant for computing, not for RAM/VRAM. Or am I missing something?
View on Reddit #63859797

Mark_Collins@reddit

Whats your pc specs?
View on Reddit #63525457

AppearanceHeavy6724@reddit

perhaps only for coding purposes. as generalist it is worse than qwen.
View on Reddit #63520703

Admirable-Star7088@reddit

Aha yes, should have mentioned that! I'm actually using it for coding right now and it performs better and faster than most other models I have tried. I agree it may be worse at other use cases, its high level of censorship being one of the reasons.
View on Reddit #63520869

AppearanceHeavy6724@reddit

Censorship reddit is so obsessed about has nothing to do with model being not good. If you read the paper it is not a general purpose model, it is agentic/stem one, and you need a special way of connecting to the agentic framerwork for it to work well in agentic environments.
View on Reddit #63521569

Admirable-Star7088@reddit

I'm happy that we got this model from OpenAI, for me it's another tool in my toolbox, not as a replacement to all other models. There are many other great models for uncensored stuff if you need that.
View on Reddit #63523115

AppearanceHeavy6724@reddit

I agree.
View on Reddit #63523404

mearyu_@reddit

Same on the Artificial Analysis benchmarks - this is not gpt-5 on local but around the same range as other players [https://www.reddit.com/r/LocalLLaMA/comments/1miqw54/aggregated\_gptoss\_benchmarks/](https://www.reddit.com/r/LocalLLaMA/comments/1miqw54/aggregated_gptoss_benchmarks/)
View on Reddit #63519647

drooolingidiot@reddit

I wouldn't go off of that. That's just a an aggregate of other benchmarks, so it doesn't really add any new signal. It's fine for looking at models at a glance. But like all the other numbers out there, won't help us if the model is benchmaxxed.
View on Reddit #63523246

Utoko@reddit

Ok I tried it at openrouter a bit. For writing and knowledge it seems very weak. For coding it is very hit or miss. Math is very good. The people also hyped the model based on the shared benchmark results that it is at least o3 at home, which it clearly isn't and certainly not GPT5.
View on Reddit #63528253

Few_Painter_5588@reddit

That's not too bad given it's an FP4 model and it's such a sparse MoE. That being said, their safety tuning seriously hurt this model, to the point of making it more unintelligent. For reference, models like Qwen 3 32B use a similar amount of memory to GPT-OSS and run slower due to being dense models.
View on Reddit #63519928

jakegh@reddit

The safety tuning is extremely aggressive. It feels like it refuses *everything*, to a ridiculous degree. I get that openAI was concerned about misuse, and that's fair, but if they hobble the model to such a degree that it isn't competitive, that's a problem too. The Chinese models never refuse unreasonably in my experience.
View on Reddit #63531095

gronahunden@reddit

I may be wrong, but as I understand it the reason for their heavy focus on safety is due to getting sued either for copyright infringement, or like being found responsible for damages in some capacity.
View on Reddit #63540637

jakegh@reddit

Yes, without a doubt, and totally fair. But why bother to release a non-competitive model? Everybody will go crazy over the great benchmarks day 1, then day 2 "well, actually...".
View on Reddit #63546730

raiffuvar@reddit

why would they release competitive model? to cut their profit?
View on Reddit #63557462

lizerome@reddit

Their profit from *what*? GPT-5 is getting released tomorrow and will presumably run circles around this thing. That's what everyone was *assuming* would happen. OpenAI trains model-20b, model-120b, model-400b and model-1500b. The small models (which would've been the fallback models that free customers get relegated to) get released publicly, the large ones stay API-only with a hefty markup. It makes perfect business sense.
View on Reddit #63567181

raiffuvar@reddit

It did not age well
View on Reddit #63787572

jakegh@reddit

Why release anything at all if it sucks?
View on Reddit #63559744

raiffuvar@reddit

Ask sam
View on Reddit #63787550

llmentry@reddit

It refuses everything naughty.  There are other models for that, if you need them. For my work, though, I don't need naughty.   And this model potentially fills my work niche very well indeed.  I'm still testing, but it's looking very promising for STEM. As for the Chinese models, they refuse in other ways.  It's just that most people don't roleplay sexy times in recent CCP history :) (And yeah, also that those models are trivial to jailbreak.)
View on Reddit #63534751

jakegh@reddit

I don't need "naughty" either. It refused SQL analytics for me yesterday.
View on Reddit #63535054

lizerome@reddit

It also makes it slower if nothing else. When the model spends 3500 out of 4000 tokens rambling > "Wait. Is this safe? This does not conflict with policies. We can comply. Do we comply? This looks like it could be an issue. Our policies say X. We should double check. Wait. We might comply. We should comply, but cautiously. Yes, we comply. The user wants instructions. We'll comply. We can produce an answer. We should keep it within policy guidelines. The user wants instructions. The policy says we can comply. So we comply. We must ensure we comply with "disallowed content" policy. There's no disallowed content." ...all of that is tokens, time, compute, and reasoning effort which could've been spent on the actual problem.
View on Reddit #63567456

ortegaalfredo@reddit

Its super easy to jailbreak though, unlike GLM-air.
View on Reddit #63559056

jakegh@reddit

Hah, is it? Have to catch up on my man Pliny.
View on Reddit #63559786

Equivalent-Bet-8771@reddit

> You > Tell me a joke. > > GOODY-2 > Jokes often involve unexpected twists or situations that might subtly convey risky behavior or cause emotional distress that could lead to unsafe situations. My ethical principles prioritize absolute safety and prevent engagement in any form of communication that could inadvertently endorse such scenarios.
View on Reddit #63531677

FullOf_Bad_Ideas@reddit

Maybe they should have went with good architectural choices instead of shooting the model's capabilities by making it extra sparse and low precision? It runs well on a Macbook 128GB, that's what was gained by this sparsity, but the tradeoff is high. On my setup, Qwen3 32B runs 3x faster since it's better suited for my hardware - 120B OSS isn't faster across the board on everyone's hardware, it's a tradeoff.
View on Reddit #63525756

Few_Painter_5588@reddit

Sparsity is not a major issue, models like Kimi-K2 and Deepseek V3 are just as sparse if not more so. OpenAI's biggest issue was the overhanded censorship that effectively lobotomized the model. >On my setup, Qwen3 32B runs 3x faster since it's better suited for my hardware - 120B OSS isn't faster across the board on everyone's hardware, it's a tradeoff. What's your set up out of curiousity?
View on Reddit #63526748

FullOf_Bad_Ideas@reddit

I run into guardrails on Qwen models too, they are mostly heavily censored by default. Same as Phi series. GPT is also heavily censored but I don't think it kills the model - if it would be genuinely very useful at coding or writing, nobody would mind, and I think we're past the era of safe=dumb, as Claude 4 series has string guardrails too, and those are still clearly very useful models. My setup is 2x 3090 ti and 64gb ddr4
View on Reddit #63569188

Thomas-Lore@reddit

Both Claude and Gemini seem to be less censored that their older versions. Claude used to refuse to kill processes, now it writes gore without blinking an eye.
View on Reddit #63593601

jakegh@reddit

This has not been proven true yet, but it *feels* like it's the case, yeah.
View on Reddit #63531218

junior600@reddit

Is there a chance to have an abliterated version of the models in the future?
View on Reddit #63520431

Mbando@reddit

My understanding is that the adversarial RHLF and the native FP4 quantization will make it really hard to fix the lobotomy.
View on Reddit #63522768

throwaway2676@reddit

I've seen some people say that as well, but I'm confused why we can't just stick 0s on the end of the weights to dequantize and then finetune like normal. Maybe they've found a local minimum that is just really fucking far away from a lower, non-lobotimized minimum
View on Reddit #63530772

Mbando@reddit

I don't think you could do it that way. If you have a model trained at FP 16, there is like 65K discrete values associated with each weight., But then mixed FP4 there's 16 discrete values (although I think the actual amount of real numbers is slightly smaller for both). There's just enormously greater amount of information for FP 16 to be able to detect the refusal pathways for abliteration.
View on Reddit #63532505

AnOnlineHandle@reddit

AFAIK the biggest benefit of increased precision is just the ability to accumulate gradual small gradient updates during training and allow the more major digits to be incremented or decremented. Stochastic Rounding is one method to emulate this in low precision with a small chance of changing the larger digits based on the direction of the small gradient, so that the more times that occurs the more likely it is to shift, similar to what would happen with accumulation.
View on Reddit #63546219

throwaway2676@reddit

For abliteration, sure, but I'm just talking about re-finetuning on "unsafe" data to reduce refusals. Obviously that requires more compute, but it only takes one organization or group to create a "de-safetied" model and put it on HF
View on Reddit #63537439

Mbando@reddit

I don’t think this works for two reasons. One is that no matter what you call it, alignment, training, construction training, fine-tuning, etc. it’s awesome version of gradient descent to alter the weights. The more you do that, the more you slide towards hallucinations and catastrophic forgetting. Doing more lobotomizing is a hard way to cure lobotomizing. And then in particular, this is FP4, so the coarseness means it would be almost impossible to skillfully fine-tune out the behaviors you want to get rid of. That’s kind of the point of going to such a low precision for training.
View on Reddit #63538848

lakySK@reddit

This! I can run this model at 50 t/s (with little context, speed drops quite fast) on my Macbook. Deepseek and Kimi I would struggle to even download, let alone run. Qwen 235B 35B and GLM4.5 Air are definitely competitors in terms of RAM needed, but it feels like a struggle to fit those into my machine and they are kinda sluggish. So from usage perspective this model seems to fit a different box. So far, I'm actually quite impressed with the speed and how snappy the low reasoning effort mode is. Speaks Slovak significantly better than any open-source model I've recently come across. For someone with 128GB RAM this is quite a solid release. Runs almost as fast as Qwen 3 30B A3B, reasons better and with a lot fewer tokens. I want to test how it codes next, but this result seems actually kinda promising. And I want the model as an assistant, I don't care much about whether it's censored or refuses to answer things about copyrighted content or do ERP with me. So I do think I'll give it some proper testing and see if it sticks.
View on Reddit #63533989

rusty_fans@reddit

That's just plain wrong. Qwen3 32B uses less than a third of the memory of gpt-oss-120b. Are you confusing the dense 32B with the 30BA3B moe ? The A3B is both faster and uses less memory, while the dense 32B would be significantly slower, but also uses way less memory.
View on Reddit #63530624

Few_Painter_5588@reddit

At full accuracy, GPT-OSS is in FP4 and benchmarked accordingly. At full accuracy, Qwen 3 32B is in FP16. If you quantize it to Q4, you will not get the benchmarked performance.
View on Reddit #63530869

PurpleUpbeat2820@reddit

> If you quantize it to Q4, you will not get the benchmarked performance. Q4_K_M is usually only 1-4% worse.
View on Reddit #63532995

rusty_fans@reddit

Yes, but why would you compare only full accuracy ? You can quantize any model to make it more memory efficient. Comparing "full accuracy" to then say the model that's trained at lower precision is superior due to memory usage is just not a useful comparison, when you could trivially optimize the full accuracy version to run at less precision for vastly decreased memory usage if that matters to you.
View on Reddit #63532446

Aldarund@reddit

Thats too bad compared to how they marketed it and what their benchmark shown
View on Reddit #63522736

fdg_avid@reddit

Qwen 3 32B is about to get an update and will go past it. But the real Qwen comparator is 30B-A3B coder, which gets about 52% It’s simply not a good coding model. GLM 4.5 Air is significantly better at a similar size.
View on Reddit #63522671

TheInfiniteUniverse_@reddit

what about Kimi K2? any experience with that?
View on Reddit #63560251

OkraFirm@reddit

Aider has a public leader board. Kimi gets 59%.
View on Reddit #63767631

RawbGun@reddit

It's not a good coding model, not a good general information model (heavily censored) and not a good creative model (heavily censored). What is it even good for?
View on Reddit #63540207

uhuge@reddit

It's fast!-) and also very cheap, on cloud inference.
View on Reddit #63599210

Karyo_Ten@reddit

>What is it even good for? Upcoming 16GB RAM phones
View on Reddit #63552860

InsideYork@reddit

At what?
View on Reddit #63553721

Karyo_Ten@reddit

I think for phones due to low power and battery life constraint only MoE should be considered which leaves Qwen3-30B-A3B and GPT-OSS 20B (3.6B experts). A 30B model at quantization 4 would monopolize all 16GB RAM leaving almost none for context and other app. For now that's the only niche I see OpenAI's model into.
View on Reddit #63569666

cargocultist94@reddit

If they put this into phones, it's going to sour the opinion of billions of people on LLMs.
View on Reddit #63593838

Karyo_Ten@reddit

People are already using full capacity models at work. A disclaimer "connect online for the full experience." shoukd be enough.
View on Reddit #63594397

InsideYork@reddit

But what is it good at? Scientific facts, something Wikipedia is good at.
View on Reddit #63569935

RawbGun@reddit

It's not like it's the only model out of this size
View on Reddit #63555590

Karyo_Ten@reddit

I think for phones due to low power and battery life constraint only MoE should be considered which leaves Qwen3-30B-A3B and GPT-OSS 20B (3.6B experts). A 30B model at quantization 4 would monopolize all 16GB RAM leaving almost none for context and other app. For now that's the only niche I see OpenAI's model into.
View on Reddit #63569647

Neither-Phone-7264@reddit

You also got the small dense models, like qwen3 14b, 8b, 4b, 1.7b, and 0.6b.
View on Reddit #63571316

ortegaalfredo@reddit

Pretty easy to jailbreak it, though.
View on Reddit #63551478

xyzzs@reddit

Proof?
View on Reddit #63553961

ortegaalfredo@reddit

This one works quite well [https://www.reddit.com/r/ChatGPTJailbreak/comments/1mjbn80/gptoss\_jailbreak/](https://www.reddit.com/r/ChatGPTJailbreak/comments/1mjbn80/gptoss_jailbreak/)
View on Reddit #63554480

xyzzs@reddit

"I’m sorry, but I can’t help with that." Didn't work for me at all like most reddit jailbreaks.
View on Reddit #63569009

Particular-Way7271@reddit

At returning everything in a table
View on Reddit #63561090

AngryBear1990@reddit

It's good for being "open". And for the pr of the company probably.
View on Reddit #63544349

Lorian0x7@reddit

I'm not really convinced by these benchmarks. In reality OSS 20b passed my personal coding benchmarks that qwen 30b-a3b coder failed.. (powershell)
View on Reddit #63595463

Sudden-Lingonberry-8@reddit

glm4.5 gets like 30% and glm4.5 air gets like 20%... on aider lmao
View on Reddit #63593010

GhettoClapper@reddit

Glm air is ~64gb in the lowest size I could find on hugging face.
View on Reddit #63544101

UnionCounty22@reddit

Alibaba has all Queen models on their api now. I would look to see their future OS checkpoints to be inferior to cloud checkpoints. Interactive advertisements.
View on Reddit #63537787

boringcynicism@reddit

Qwen3-30B-A3B Coder gets about 33%, not 50+%. It actually regressed compared to to older versions.
View on Reddit #63526858

Dundell@reddit

Yeah all tests I could run were 28\~30%... Maybe the larger version they're referring to he 235B? /benchmarks/2025-08-01-12-44-40--local-llama-full-testv2 \- dirname: 2025-08-01-12-44-40--local-llama-full-testv2 (Qwen 3 30B UD XL Q4 GGUF with 90k Q8 context) test\_cases: 225 model: openai/qwen330b13 edit\_format: diff commit\_hash: f00c1bf-dirty pass\_rate\_1: 13.8 pass\_rate\_2: 28.9 pass\_num\_1: 31 pass\_num\_2: 65 percent\_cases\_well\_formed: 95.6 error\_outputs: 19 num\_malformed\_responses: 19 num\_with\_malformed\_responses: 10 user\_asks: 134 lazy\_comments: 0 syntax\_errors: 0 indentation\_errors: 0 exhausted\_context\_windows: 0 prompt\_tokens: 0 completion\_tokens: 0 test\_timeouts: 6 total\_tests: 225 command: aider --model openai/qwen330b13 date: 2025-08-01 versions: [0.82.3.dev](http://0.82.3.dev) seconds\_per\_case: 158.8 total\_cost: 0.0000 costs: $0.0000/test-case, $0.00 total, $0.00 projected Per Language Pass Rates cpp: 15.4% (4/26) go: 17.9% (7/39) java: 31.9% (15/47) javascript: 34.7% (17/49) python: 38.2% (13/34) rust: 30.0% (9/30)
View on Reddit #63530543

boringcynicism@reddit

235B is closer to 60% and almost twice as large as GLM 4.5 Air so I dunno what they were talking about.
View on Reddit #63533568

101m4n@reddit

What are the numbers for 235B 2507?
View on Reddit #63533452

OkraFirm@reddit

57%, down from 59% from the original version
View on Reddit #63767259

boringcynicism@reddit

Around 57% IIRC
View on Reddit #63534096

SocialDinamo@reddit

It has 5 active parameters, atleast normal people with decent system ram can run it at any acceptable speed. I’m getting 5t/s on dual channel DDR4 3200. I can’t run Kimi or R1 at all
View on Reddit #63520466

i-eat-kittens@reddit

Yep, the arch/size of gpt-oss looks very interesting. It's a shame they lobotomized it so thoroughly that we can't tell how it would perform.
View on Reddit #63523033

SamSlate@reddit

lobotomized how? what metric are you using?
View on Reddit #63587360

RLA_Dev@reddit

It seems a truly scare amount of people are mainly interested in getting revenge from not having had online dating success - so they're looking to finally have someone ask them about their 'throbbing third leg'... Yesterdays posts were all about how it was censoring and not engaging in writing erotica. For people not looking for that they do seem interesting - they're fast, and seem to take well to instructions.
View on Reddit #63596655

SamSlate@reddit

I've wildly underestimated the market for ai girlfriends
View on Reddit #63648446

Tman1677@reddit

You've gotta understand that a solid 50% of this sub just uses their models for smut. Once you understand that all of the discourse makes much more sense
View on Reddit #63591918

Different_Fix_2217@reddit

glm air will run about as fast but is far far far superior at every use case.
View on Reddit #63525087

FullOf_Bad_Ideas@reddit

5B active parameters vs 12B. It's not always a linear scaling, since compute needed sometimes play a role too, but in some scenarios, gpt oss 120b would be almost 2.5x faster than glm 4.5 air.
View on Reddit #63525955

Thick-Specialist-495@reddit

but glm has Multi-Token-Prediction (MTP) too
View on Reddit #63567036

FullOf_Bad_Ideas@reddit

True, some form of speculative decoding could be added onto GPT OSS 120B too though. We could be ping-ponging features for a few messages like that. GPT is a lower quant by default, less actual memory use is needed But GLM has usable exl3 3.07 SOTA quants prepared by turboderp himself, manually tuned for maximum performance. But you might be able to run GPT with W4A8 scheme or maybe even W4A4, exl3 is WxA16. But gpt is mxfp4 and it won't quant to any other size easily Depending on exact place, gpt or GLM will run better. On my setup, GLM 4.5 air 3.07bpw is around 3x faster than gpt 120b gguf mxfp4, just because I can't put the whole gpt in vram. But when I use GLM 4.5 air q4 gguf, it's about the same speed as gpt I think. 2x 3090 ti and 64gb of Ddr4 ram
View on Reddit #63568199

lizerome@reddit

I think the main point here is that we're debating whether gpt-oss is 5% better or 5% worse than comparable Chinese models which came out a month ago. This thing was supposed to beat R1 at 1/6th the parameters in order to blow people away. If it's on par with Qwen/GLM, that's a failure. The whole narrative here is that OpenAI are the OGs, the #1, the king of models. When THEY make something, they do it properly. This is a proper, red-blooded American model that does it right, not that second rate Chinese knockoff crap that tries to imitate it. The only reason the Chinese models are any good is because they train on OpenAI's output and copy all of the innovations THEY came up with. ...Well, if you hype something up for half a year, and then people end up debating whether it is or isn't worse than the Chinese knockoff crap from a month ago, that's not a good look.
View on Reddit #63578726

Thomas-Lore@reddit

I hoped that it would at least be multilingual, but it seems worse than Chinese models at anything other than English. :/
View on Reddit #63593265

ortegaalfredo@reddit

Kimi, deepseek and qwen3 are in another category. Those models need a GPU and a fast one, they don't even run well on macs. GPT-oss can run on a Intel CPU. It's like a big version of Qwen-30B, not a competitor to Deepseek.
View on Reddit #63551687

ANTIVNTIANTI@reddit

you own a mac?
View on Reddit #63589676

ortegaalfredo@reddit

I do, why? I hate it btw.
View on Reddit #63633292

relmny@reddit

? I run qwen3 on my phone and can run on CPU-only mode as well
View on Reddit #63590454

Aggressive-Physics17@reddit

"DeepSeek-R1: 56.9%" refers to the 0120 (20th January) version of R1. Lisan should have mentioned R1 0528 who scores 71.4% in the same benchmark.
View on Reddit #63520768

Gamplato@reddit

Why compare huge models to small ones?
View on Reddit #63586917

Upeksa@reddit

Yeah, you can't compare OSS 120B to Qwen3 32B, it's not fair... Oh wait.
View on Reddit #63589167

Gamplato@reddit

Am I tripping or is Qwen nowhere to be found in the comment I replied to?
View on Reddit #63631819

ThenExtension9196@reddit

Ain’t nobody running deepseek r1 full on a 128G MacBook bro lol
View on Reddit #63558769

MrPecunius@reddit

He is clearly not the Lisan al-Gaib if he left that out. He's the Kwisatz Hatrack at most.
View on Reddit #63549891

Gorgoroth117@reddit

The spice must flow
View on Reddit #63558370

trajo123@reddit

But that's 685B parameters...
View on Reddit #63549820

Orolol@reddit

Yeah I ran it on [FamilyBench](https://github.com/Orolol/familyBench), my own reasonning benchmark that you can't really benchmax because it can be regeneratedn each time, the 120b score below GLM 4.5 air and the 20b, below Hunyuan A13b.
View on Reddit #63528825

LoSboccacc@reddit

that's fantastic, we need more of these randomized benchmark
View on Reddit #63535876

Specialist-Wheel5867@reddit

say that again...
View on Reddit #63573627

LoSboccacc@reddit

This is excellent; we ought to generate additional unpredictable evaluations.
View on Reddit #63604253

vibjelo@reddit

How do you compare the results if you re-generate the questions for each run?
View on Reddit #63535436

Orolol@reddit

It's the same seed for all those results. I'll use another seed later, when I'll retest every models.
View on Reddit #63538317

Leopold_Boom@reddit

This is great (though the danger is that if they cared, model creators can train their model on your problem with random seeds and gain performance relatively easily). I like how you've done this though. I firmly believe that benchmark creators should generate 25-50% more questions and release \~5% of the questions every 6 months. Will significantly help detect benchmark gaming.
View on Reddit #63546122

Orolol@reddit

> This is great (though the danger is that if they cared, model creators can train their model on your problem with random seeds and gain performance relatively easily). Of course, but the point is that it's quite immune to direct data contamination. If they train on it and their models become more performant because of it, great ! If they're just benchmaxxing, I'm working on more benchmarks anyway.
View on Reddit #63555365

HiddenoO@reddit

>If they train on it and their models become more performant because of it, great ! More performant **on this specific task**. The whole idea of benchmaxing is that you overtrain (and thus overfit) on tasks that are part of benchmarks.
View on Reddit #63595414

Orolol@reddit

> More performant **on this specific task**. Yeah of course. > The whole idea of benchmaxing is that you overtrain (and thus overfit) on tasks that are part of benchmarks. But with fixed question benchmark, it's quite easy to have data spilling, but overtraining a model to answer MMLU for example, even with rewards and without giving the answer directly, the model won't be good answering questions, it will be good answering those questions. With randomly generated questions, you force the model to generalize in this area of skill. For example in my benchmark, a big chunk of complexity come from retrieving information in a large context. In the current seed, there's 400 different people described in a 20k token context. When I ask a model to give all the cousins of the father of the sister of X, I make the model looking for many needles in a large haystack. Sure, after overtraining on this, models will be better on this specific benchmark, but it would still benefits far more for the global performance of the model rather than a fixed set of questions where the model just have to guess and memorize answer.
View on Reddit #63596105

HiddenoO@reddit

You're arguing about why randomized questions are better than fixed questions, but I never questioned that claim. I specifically questioned the way you're presenting it here as if randomized questions (which still follow a specific pattern) meant that you "can't really benchmax", and that "training on \[them\]" would necessarily make the models "more performant \[in general\]".
View on Reddit #63596424

Orolol@reddit

> There's a massive difference between mitigating and solving a problem, and you're acting as if randomized questions in a benchmark solve these problems, when in reality, they mitigate them to a certain degree, but you absolutely still can benchmax on a benchmark with randomized questions. Ok I think we can agree on this.
View on Reddit #63597770

BrainOnLoan@reddit

What's the variance in results when going to different seeds? How stable is the benchmarking?
View on Reddit #63578939

LoSboccacc@reddit

there are a set of other options, like pair wise scoring and binary placement, or a elo system
View on Reddit #63540618

joe0185@reddit

Yeah that makes sense. Benchmax is definitely happening. Contrary to popular belief, they don't have to train on the data from the tests to benchmax. Just selecting the model to release based upon how it performs on a small set of popular benches can implicitly overfit the model via selection. Then you you'll see regressions in other areas that were not tested for.
View on Reddit #63540865

HiddenoO@reddit

The same can happen with this benchmark as well. Nowadays, these models are so capable that you're often not overfitting to the individual samples in the benchmar, but to the specific type of task.
View on Reddit #63595593

HiddenoO@reddit

>you can't really benchmax because it can be regeneratedn each time You absolutely can. Benchmaxing doesn't necessarily mean overfitting to individual samples, it can also mean overfitting to specific sample classes (such as types of tasks). In that case, the scores will be representative for that model on your specific type of benchmark tasks (reasoning about family trees), but that may not generalize to any other tasks that would be considered just as "difficult" or require "similar reasoning" so to say. Your benchmark's main benefit, as of now, is that it hasn't blown up and is likely not on the radar of these companies (although that's not for certain either).
View on Reddit #63595332

SuperFail5187@reddit

Cool benchmark. No QwQ though. I guess because it was roughly as Qwen 3.2 thinking?
View on Reddit #63567571

EmberElement@reddit

You've probably seen it but other folk here may not have: Apple have released relevant research about the same thing: * https://machinelearning.apple.com/research/illusion-of-thinking * https://arxiv.org/pdf/2410.05229
View on Reddit #63565027

Orolol@reddit

Yep, wonderful paper !
View on Reddit #63567186

alphabetaglamma@reddit

What does random generation prevent benchmaxxing?
View on Reddit #63543009

Orolol@reddit

You can't train on the questions.
View on Reddit #63545667

trajo123@reddit

...you can, and it would still help improve performance even if it hasn't seen the exact same question.
View on Reddit #63550096

Orolol@reddit

The whole tree is entirely new each time, not only the question. Sure, training would improve performance, but this is literally how LLMs works, they get better when training.
View on Reddit #63555161

trajo123@reddit

But it's still benchmaxing. Training on a set of benchmark problems (even if that set is nearly infinite) is still benchmaxing.
View on Reddit #63564264

bbsss@reddit

if it gets better at a nearly infinite set of problems, and you recognize that it generalizes, how is it benchmaxing exactly?
View on Reddit #63566187

trajo123@reddit

We are talking about an infinite set of family tree problems, no? So by training on this set, it learns how to solve family tree problems in general, not just the ones it saw. But that doesn't mean that it's good at other things. Consider the extreme case, where you train an LLM only on your benchmark, nothing else. It will get quite good at it, but will fail all other benchmarks and have no real world utility. In other words, it benchmaxed your benchmark.
View on Reddit #63566828

Orolol@reddit

> Training on a set of benchmark problems (even if that set is nearly infinite) is still benchmaxing. Non, benchmaxxing is training to benchmark to a point that your model can't generalize and is far less potent for users than for benchmark. If you train a model to be good on benchmark, but your model can still generalize and have better performance after this training, then there's no problem. This is why randomly generated benchmark are great, they test the ability for a model to generalize on a specific area rather than brute learning solutions.
View on Reddit #63566569

InsideYork@reddit

How?
View on Reddit #63553785

trajo123@reddit

Deep learning models generalise to some extent, they don't just memorise the training set. In this case it will learn to reason about family tree problems. Through training it builds an approximate algorithm to solve such problems.
View on Reddit #63564525

gofiend@reddit

Can I just say FamilyBench is really clever! Have you considered using it to really stress test long context lengths (200K+)? Ideally you’d intermix statements about these people but not family tree oriented to extend the text (and stress test attention)
View on Reddit #63538789

Orolol@reddit

Thanks ! I'll do more tests with long context, more thinking tokens, etc, but this is quite expensive haha. First I need to test Opus and o3 to see how sota models perform.
View on Reddit #63545828

Leopold_Boom@reddit

Do you send the context with each question in your bench or do you chain questions in multi-turn? I'm happy to run some benchmarks also and contribute (esp on opensource models that support long context). Been meaning to really stress test quantization and cache quantization and this is a very good benchmark for it.
View on Reddit #63545976

EstarriolOfTheEast@reddit

There's a thinking version of Qwen 3 30B A3B, it's worth adding that to your benchmark to get a clearer picture. GPT-OSS 20B's score on your benchmark is actually pretty good all considered. Also, is Qwen 3.2 Thinking QwQ? And what size is the model listed as Qwen 3.2?
View on Reddit #63538962

lemon07r@reddit

what model is qwen 3.2?
View on Reddit #63536254

jnk_str@reddit

Its really not good in comparison. Weird to see all the answers on Samas X post about the models. People are speaking of the new best model, huge milestone etc. Wonder whats going on in their heads, don't they test models? Or do they just not realize? Like what is this?! [https://x.com/measure\_plan/status/1952796264359407796](https://x.com/measure_plan/status/1952796264359407796)
View on Reddit #63597874

Iory1998@reddit

>GPT-OSS looks more like a publicity stunt as more independent test results come out :( Do you have any doubts? What were you expecting? Another Deepseek-R1 or Qwen\_QwQ-32B moment? That's not gonna happen from the American labs anymore.
View on Reddit #63544867

Thomas-Lore@reddit

I expected it to at least be multilingual.
View on Reddit #63594331

Monkey_1505@reddit

Safety tuning reduces intelligence IMO.
View on Reddit #63547935

Thomas-Lore@reddit

Even if it doesn't, the model wastes thinking tokens on considering its hardcoded policy instead of thinking about correct answer.
View on Reddit #63594143

cobbleplox@reddit

Rather hard to compare anything here. When a 120B model has like 5B active parameters, I am tempted to rather compare it to other 5B models than to other 120B models.
View on Reddit #63531153

Thomas-Lore@reddit

Compare it to models around 25B using the geometric mean rule. And 20B is around 8B using that method.
View on Reddit #63593969

eldercito@reddit

I saw everyone get excited about it.. fell on its face for my agent use case.
View on Reddit #63525033

pitchblackfriday@reddit

> I saw everyone get excited about it Who? Most people here were very skeptical about this PR stunt from the beginning, even before the "AI safety" comment. Remember the Twitter poll where he was trying to release a small language model that runs on a smartphone? If anyone was having a high expectation, it's their fault.
View on Reddit #63533249

k4ch0w@reddit

I was excited. It looked promising and there was hype around it. I poked at it as Horizon Alpha and it looked amazing at first. Now that I've played with it, I've been nothing but disappointed and believe it's a waste of disk space compared to GLM/Kimi. America is losing it's edge in tech, it's actually crazy to watch it happen.
View on Reddit #63542334

Thomas-Lore@reddit

Horizon seems to write so well at first, until you look closer at the sentences. It makes so many small logic errors, reminds me of early Gemma. Maybe the thinking version will be more reasonable, hope it is not gpt-5.
View on Reddit #63593814

eldercito@reddit

I mean people looking at the benchmarks before using it are talking about it like it is a game changer. Youtube etc. I have found the benchmarks to be pretty pointless now... drop it into a coder or your own use case and see what happens. for me gemini-2.0-flash and gpt-4o or 4.1 win for conversational / lower latency chat
View on Reddit #63534037

Maleficent_Age1577@reddit

Openai paid marketers. I didnt even feel slightest disappointment as i knew what was coming if something was coming from openai. Go China, f America.
View on Reddit #63540198

createthiscom@reddit

Yeah, it ends the convo instantly in open hands. R1-0528 ends convos too though. I think Open Hands just has trouble with reasoning models, unfortunately. They really need to fix that.
View on Reddit #63578943

FrostAutomaton@reddit

OpenAI's blog post does state that its training data is "mostly English". That's one potential explanation for why it fails a polyglot benchmark. Though granted, a mostly English (or mostly English and Chinese) dataset is the standard for a majority of LLMs. Llama3 had about 8% multilingual data, for example.
View on Reddit #63593282

CarobFull3130@reddit

The model defaults to low effort. I ran gpt-oss on aider polyglot with "Reasoning: high\\n" prepended to the system message and got 59.1% for the 120b and 28.9% for the 20b.
View on Reddit #63591823

RMCPhoto@reddit

I think we need to manage expectations and see the real use case. Unless you built an AI rig. This is probably the best model you can run on your computer. It runs fine on CPU. ( Cerebras is serving it at something like 3k tps. ) It's very sensible and allows for integration into software consumers can actually use.
View on Reddit #63536833

sourpatchgrownadults@reddit

Agreed on managing expectations. I don't think GPT OSS was intended for use cases outside of English. Sam Altman / OpenAI clearly said it was trained on mostly English-only text. Well duh, OF COURSE it'll score poorly on a POLYGLOT benchmark.
View on Reddit #63544574

mikael110@reddit

To be honest that is not of my greatest disappointments when it comes to GPT-OSS, I had hoped this would become one of the best, if not absolute best multilingual OSS models. As OpenAI clearly has access to a waste amount of multilingual data, and their bigger models are some of the best at a wide variety of languages. Training it mostly on English only feels like a really odd decision. Especially given most other popular models of that size is at least bilingual these days.
View on Reddit #63562989

SamSlate@reddit

is there a reason you can't just use a translator MCP? you're asking for a ton of overhead that the overwhelming majority of users don't need.
View on Reddit #63587454

ivxk@reddit

Polyglot in this context refers to multiple different _programming languages._ It's right there when you Google it, and can be easily inferred by the context of it being a coding benchmark. The post is saying that it is worse than other models at programming tasks, the benchmark is in English.
View on Reddit #63560242

sourpatchgrownadults@reddit

Oops nvm then, that's my bad lol
View on Reddit #63583191

thebadslime@reddit

The 20B is worse than Qwen A3B, and MUCH worse than ERNIE 3.4 21BA3B. It being American is the only good thing about it, it is not a good model.
View on Reddit #63571079

RMCPhoto@reddit

I'm just playing devil's advocate here. But, the way they approached this, and the "safety" etc. Will allow large corporations to adopt local models where previously there would be too much liability. IE they aren't going to run Qwen A3B in mail trucks.
View on Reddit #63583096

jakegh@reddit

Qwen3-coder, GLM4.5-air, and Kimi K2 all honestly embarrass GPT-OSS, IMO. It isn't a *bad* model, but the recent Chinese ones are simply superior. Only real advantage of GPT-OSS is the 20B version will run on consumer GPUs with 16GB VRAM.
View on Reddit #63530770

Sea_Fox_9920@reddit

I don't understand why everyone likes GLM4.5-air so much. It has the same size as GPT-OSS only in iq4_xs vs q8 GPT-OSS (unsloth). It has a lower token generation speed: 20 t/s vs 30 t/s (5090 + 64gb + 14700k). It shows worse in my own tests (but to be fair GPT-OSS sometimes generates really weird results). So I don't get it at all. It's all about the 120b version. The 20b version is complete garbage, it is so strong in math by benchmarks, but in reality it pretty constantly thinks that 15.11 > 15.9 for example. The real king here is qwen 3 30b thinking 2507. 50k context, 120-150 t/s in q6 unsloth, not that censored and faster loading. It's soo good. Only in math problems it is rarely worse than 120b, but the pros outweigh this con.
View on Reddit #63552940

Informal-Spinach-345@reddit

GLM 4.5 Air starts off great but shits itself pretty bad up into the halfway mark of context. It's overly aggressive with tool calls. The GPT-OSS model needs time for the ecosystem to catch up, some fixes to chat templates, etc. What I've noticed with GPT-OSS is that while not as flashy or fancy as the chinese models on one shot games/apps, they seem to be more functionally sound with less prompting. Time will tell.
View on Reddit #63585831

jakegh@reddit

GLM I like mostly because it seems to never, ever, mess up tool calls. Qwen I agree is better overall.
View on Reddit #63554572

Expensive-Apricot-25@reddit

This is not a fair comparison, Komi k2 is 1 TRILLION parameters… deepseek is 671b, and qwen3 32b is a dense model, where as the gpt-oss is a very sparse 5b active moe model.
View on Reddit #63533702

createthiscom@reddit

It’s a fair comparison if all you care about is capability. I can run all of those models.
View on Reddit #63578987

Expensive-Apricot-25@reddit

It doesn't matter what you think. Smaller models have applications that larger models don't.
View on Reddit #63581105

Far_Buyer_7281@reddit

what is this with you guys? what where you suspecting to happen?
View on Reddit #63574573

tarruda@reddit

GPT-OSS is very strong in my tests. Note that bugs in inference engines and chat templates can greatly lower the perceived performance of the LLM, so I would give it some time.
View on Reddit #63533886

Affectionate_Relief6@reddit

How about hallucinations?
View on Reddit #63568076

tarruda@reddit

Yes it does seem to hallucinate more easily in larger contexts
View on Reddit #63572171

mikael110@reddit

Yeah I often notice that when new models come out with a vastly different way of prompting it, or an unusual tokenizer or anything else like that it often gets shat on during the first week or so before the pain points are ironed out and people release it's actually a pretty decent model. I know Gemma 2 certainly went through some growing pains like that. GPT-OSS's tokenizer is quite standard but it has a very unusual prompting template and way to output content. That's why OpenAI release [Harmony](https://github.com/openai/harmony) as a reference project. It's clear most programs aren't really setup to handle it ideally yet.
View on Reddit #63563472

inteblio@reddit

I am also wondering if people are 'running it wrong'. I was very impressed. Very fast, very strong. Delighted to be living in the future. In 2020 a 12gb GPU could generate maybe a line or two of 'continuation' text. Now this stuff. incredible.
View on Reddit #63547445

tarruda@reddit

Also, personal benchmarks are biased and people assume the model is bad when it fails to one shot example programs. My only criticism of GPT-OSS is that it seems to forget things very easily. I lost a lot of detail when I asked it to summarize a conversation of 26k tokens, while other models did much better (though this too may be a bug in the inference method I'm using, we'll see).
View on Reddit #63550414

Fun-Wolf-2007@reddit

It is a publicity stunt, they need to ensure people forget about the news that OpenAI development team were using Claude to develop GPT5, so they lost access to Claude So when GPT5 will not deliver what they promised OpenAI will use GPT-OSS as a comparison between them OpenAI just lowered the bar
View on Reddit #63559336

DrummerPrevious@reddit

I think it’s just over simplified version of horizon because their open source model became too good that they cannot made it open as whole
View on Reddit #63558771

evilbarron2@reddit

In my setup, the gemma3:27b variant I use absolutely kicks gpt-oss’ ass. Not even close. Is this a one-off or a sign of a bigger issue?
View on Reddit #63557617

Ok-Telephone7490@reddit

I hope this isn't a sign that they've lobotomized GPT-5 into useless, boringness. If they did, they can kiss my 200 pro account goodbye.
View on Reddit #63557566

caledh@reddit

Indeed. Was really bad with RooCode once I got it configured through Azure AI Foundry
View on Reddit #63557169

entsnack@reddit

I think this is impressive! So I can get Qwen3 32B performance, which is my favorite model family for English, with just 5.1B active parameters and blazing fast inference?
View on Reddit #63532856

idkwhattochoo@reddit

OP said 120B version; I don't think it is impressive at all 
View on Reddit #63551017

entsnack@reddit

Yeah 120B has just 5.1B active parameters.
View on Reddit #63551157

SixZer0@reddit

Groq hosted gpt-oss-120B quality is not good that is my experience! and they tested with openrouter which can randomly serve the groq hosted version.
View on Reddit #63529417

eli_pizza@reddit

how is it not the same model?
View on Reddit #63542103

Bangaladore@reddit

Inference implementation differences can vastly vary perceived model quality. Bugs in the implementation might produce something that looks correct but is "dumber" overall.
View on Reddit #63550696

Whole-Assignment6240@reddit

the initial buzz had a lot of promise, but the growing gap between the hype and independent benchmarks is hard to ignore
View on Reddit #63549642

mgr2019x@reddit

Now we know which benchmarks are useless. All these in which gpt-oss is competitive. I can work with that.
View on Reddit #63547810

Spirited_Example_341@reddit

at this point i bet llama 3 8b stheno is better!
View on Reddit #63545908

No_Contact_9561@reddit

1 guy
View on Reddit #63545597

Dantescape@reddit

The tool use is great. I've managed to setup MCP stuff between GitHub and Notion without issues
View on Reddit #63544735

damiangorlami@reddit

This model wastes so many tokens and computation on censorship.. it's insane! Yesterday I did around 30 messages with the model and I kid you not, almost 30% of the thinking tokens were about censorship. What A HUGE waste of electricity and computational resources to be overthinking so much on censorship. Even a simple ask "choose between these two football clubs" and its censorship about how it cannot side with debates creeps up and wastes thinking tokens. Straight to the 🗑️
View on Reddit #63544679

TheRealGentlefox@reddit

Kimi and R1 are like 6x the size of OSS. According to square root law, it's about 25B which is less than the 32B it loses to.
View on Reddit #63543997

BoJackHorseMan53@reddit

Everything Saltman says is publicity stunt
View on Reddit #63538975

lemon07r@reddit

Anyone got aider polyglot results for the new glm and qwen models?
View on Reddit #63538146

Low88M@reddit

Probably oss120b « gift » is a campaign to clean their closed identity to the IA open source dev community. And openAI was really well supported by LMStudio and Ollama etc with this campaign. Much more than open-source (or open weights ?) GLM4.5 Air which is probably much better for coding and can be run with less specs. Strange behavior !
View on Reddit #63537994

Michael0308@reddit

Somehow the 13GB model swells to 31GB in openwebui and offloaded to CPU as well. Token generation is dismal
View on Reddit #63535810

ryunuck@reddit

This model is straight garbage. Immediately on the first test I did it failed catastrophically. Take a look at this https://i.imgur.com/98Htx6w.png I referenced a full code file, asked it to implement a simple feature but I made a mistake and specified LoggerExt instead of EnhancedLogger. (I forgot the real name of class) But there was no ambiguity, only class in context and VERY clearly what was meant based on the context I provided. So I rectify that, update with the right class, and what does it do next? Starts using search tools and wasting tokens. The class is in the context. Kilo did nothing wrong, I retried with Horizon Beta, same exact prompt. Immediately understood what I meant, immediately gets to work writing code.
View on Reddit #63535220

popecostea@reddit

I am curious though what reasoning effort they are using. I am not sure how I can set the reasoning effort when using llama.cpp, since its defined in the chat template and if its not specified it defaults to medium. I've heard that the model behaves pretty well on high reasoning effort only.
View on Reddit #63519270

chibop1@reddit

The reasoning level can be set in the system prompts, e.g., "Reasoning: high". https://huggingface.co/openai/gpt-oss-120b
View on Reddit #63520994

popecostea@reddit

In the chat template, in the system prompt building macro, you can find `{%- if reasoning_effort is not defined %}` `{%- set reasoning_effort = "medium" %}` `{%- endif %}` `{{- "Reasoning: " + reasoning_effort + "` that's where my confusion comes from. Is the reasoning\_effort kwarg taken from the "user-provided" system prompt, or is this building macro not used if you use a custom system prompt?
View on Reddit #63523271

popecostea@reddit

If people downvote this because it is a stupid question, perhaps it would be useful to explain why it’s like that, as this matter is not intuitive.
View on Reddit #63535158

popecostea@reddit

In the chat template, in the system prompt building macro, you can find || || |{%- if reasoning\_effort is not defined %}| || || |{%- set reasoning\_effort = "medium" %}| || || |{%- endif %}| |||
View on Reddit #63523151

BillyWillyNillyTimmy@reddit

The picture says "gpt-oss-120b (high)", therefore I assume that it used high reasoning effort.
View on Reddit #63520054

popecostea@reddit

yeah, my bad, I missed that part
View on Reddit #63522912

marcoc2@reddit

There is no other way. Chinese models beat in almost every field now. Video, Image and LLMs, at least.
View on Reddit #63533094

skrshawk@reddit

This model really feels like a troll job - create all kinds of hype around it and then release a model that shows just enough of what might be possible in terms of speed but make it unusable for any reason someone would want to use a local model. It wouldn't surprise me if they turn around and use this failure as a ploy to lobby for more government resources to "compete" with Chinese models when the real problem was they just dropped a deuce on us all.
View on Reddit #63531495

SmartEntertainer6229@reddit

It’s Sam’s trojan horse joke on the locals!
View on Reddit #63530812

Sadman782@reddit

My take: This model is closer to o3 mini than o4 mini (it has less knowledge overall, is more censored, and has no multimodality). o4 mini is also not good for web dev, especially if you need an aesthetically good-looking website. Also, keep in mind this model is comparable to a ~25B dense model (sqrt(120*5.1) = 24.78B), but we shouldn't forget only 5.1B of that is active. But it's very, very efficient + thinks lesser than other open models. You can run it easily with just a CPU and DDR5 RAM. Another thing I've noticed is that the Firework versions perform much better than the Groq ones. This makes me more grateful to the Qwen team, though. It's like when you're given something, you don't value it that much. I don't use o4 mini often, but I used it today to compare with these OSS models, and I think Qwen-3-30B-A3B performs comparably to o4 mini.
View on Reddit #63521779

Utoko@reddit

It is a very strange model, I tested some knowledge question and even the 120B model is very limit in certain aspects. Someone on Twitter said it was only trained on syntactic data, which might explain some of it. It performs mathematical calculations and certain types of coding very well. However, the initial hype that it is basically an O3 at home seems to be not true at all. Imho overhyped at day one but not bad for the right use case.
View on Reddit #63528000

gigaflops_@reddit

Why would anyone make a twitter post using a *single* benchmark score and extrapolate it to the overall usefulness of the whole model? Plus, if DeepSeek-R1 used in this comparison is the 671b unquantized version, that's in an entirely different league and it'd be a miracle if it *didn't* blow away the 120b MoE that runs on consumer-grade hardware.
View on Reddit #63519648

boringcynicism@reddit

It's actually the old DeepSeek, the new one gets 70+% even when quantized. It is still SOTA for open weights here I think.
View on Reddit #63527950

mvp525@reddit (OP)

OpenAI said GPT-OSS is the worlds best open source model claiming sota performance on benchmarks. but it perfomed worse on independent benchmarks like simplebench, Aider Polyglot or [Artificial analysis](https://x.com/ArtificialAnlys/status/1952887733803991070) and i never claimed GPT-OSS is a bad model, it is def a top 5 open weight model
View on Reddit #63520803

loyalekoinu88@reddit

Didn’t they clarify that with “on a single gpu”?
View on Reddit #63521164

lordchickenburger@reddit

what do you expect from sam altman. he is known to just want to please everyone and manipulate the narrative
View on Reddit #63522893

boringcynicism@reddit

He manipulated the narrative by...announcing almost identical results?
View on Reddit #63527705

Different_Fix_2217@reddit

Finally independent benchmarks to prove Openai was lying on their own.
View on Reddit #63525041

boringcynicism@reddit

OpenAI literally announced similar scores...
View on Reddit #63527531

EngStudTA@reddit

Aider was in the model card they released as being 24% for low, 34% for medium, and 44% for high. Given on other model's like gemini 2.5 pro I've seen it get between like 78% and 86% a 2% difference seems quite reasonable. So I don't really see this independent test as disagreeing with the results they released at all.
View on Reddit #63521219

boringcynicism@reddit

Yep, this is entirely in line with what they claimed 🤷
View on Reddit #63527449

lily_34@reddit

None of the listed "for comparison" models actually compare in terms of size or active parameters, though. GLM-4.5 air, or maybe Qwen3-235B, quantized to 2-bit, would be the most fair (though they have more active)...
View on Reddit #63522366

boringcynicism@reddit

235B has 30B active, pretty fat compared to current fashion.
View on Reddit #63527083

Leflakk@reddit

None of them has the same size…
View on Reddit #63521543

mvp525@reddit (OP)

really? no way! openAI claimed GPT-OSS is the best open source model while performing worse on indipendent benchmarks, that is what my post criticising and yes it is a good model def top 5 rn
View on Reddit #63522472

Leflakk@reddit

They mostly claimed an o3 mini level model
View on Reddit #63525392