GPT-OSS looks more like a publicity stunt as more independent test results come out :(

Posted by mvp525@reddit | LocalLLaMA | View on Reddit | 228 comments

GPT-OSS looks more like a publicity stunt as more independent test results come out :(

Reply to Post

Reply

228 Comments

[-]

Desperate-Cry592@reddit

They're paving the road for for profit conversion... trying to build some goodwill before that. They'd never ever release anything useful otherwise.

Reply

[-]

JC1DA@reddit

Below is my test with diff-fence format \`\`\` \- dirname: 2025-08-06-16-31-06--gpt-oss-120b test\_cases: 225 model: openai/gpt-oss-120b edit\_format: diff-fenced commit\_hash: f38200c-dirty pass\_rate\_1: 14.7 pass\_rate\_2: 46.7 pass\_num\_1: 33 pass\_num\_2: 105 percent\_cases\_well\_formed: 77.8 error\_outputs: 84 num\_malformed\_responses: 84 num\_with\_malformed\_responses: 50 user\_asks: 147 lazy\_comments: 1 syntax\_errors: 0 indentation\_errors: 0 exhausted\_context\_windows: 0 prompt\_tokens: 4255451 completion\_tokens: 865457 test\_timeouts: 0 total\_tests: 225 command: aider --model openai/gpt-oss-120b date: 2025-08-06 versions: [0.85.3.dev](http://0.85.3.dev) seconds\_per\_case: 76.8 total\_cost: 0.0000 \`\`\`

Reply

[-]

Few-Yam9901@reddit

It turns out many providers are not supporting full reasoning high yet. They may need to update chat template. Several independent local test show scores above 65 with the top scores around 68.5. High reasoning will produce 3mil plus completion tokens. Your token count above suggest medium reasoning the default

Reply

[-]

Sorry_Ad191@reddit

gpt-oss-120 reasoning high got 68.4% check in Aider discord

Reply

[-]

AppearanceHeavy6724@reddit

41.8 is not bad. Not stellar, but not bad.

Reply

[-]

Admirable-Star7088@reddit

Also let's not forget that its pretty speedy, quite a lot faster than Qwen3-32b if you run on RAM. In a nutshell (assuming this test is true), you could describe oss‑120b as a fast version of Qwen‑3‑32b but with a touch more punch.

Reply

[-]

Apprehensive_Win662@reddit

Why is it faster if you run on RAM?

Reply

[-]

Admirable-Star7088@reddit

oss-120b only has 5.1b active parameters which means fairly low workload for RAM. Qwen3-32b is dense, meaning it utilizes all the total 32b parameters at once, which is much heavier for RAM.

Reply

[-]

Apprehensive_Win662@reddit

But nevertheless, it has to load all parameters, right? Active vs non-active parameters are relevant for computing, not for RAM/VRAM. Or am I missing something?

Reply

[-]

Mark_Collins@reddit

Whats your pc specs?

Reply

[-]

AppearanceHeavy6724@reddit

perhaps only for coding purposes. as generalist it is worse than qwen.

Reply

[-]

Admirable-Star7088@reddit

Aha yes, should have mentioned that! I'm actually using it for coding right now and it performs better and faster than most other models I have tried. I agree it may be worse at other use cases, its high level of censorship being one of the reasons.

Reply

[-]

AppearanceHeavy6724@reddit

Censorship reddit is so obsessed about has nothing to do with model being not good. If you read the paper it is not a general purpose model, it is agentic/stem one, and you need a special way of connecting to the agentic framerwork for it to work well in agentic environments.

Reply

[-]

Admirable-Star7088@reddit

I'm happy that we got this model from OpenAI, for me it's another tool in my toolbox, not as a replacement to all other models. There are many other great models for uncensored stuff if you need that.

Reply

[-]

AppearanceHeavy6724@reddit

I agree.

Reply

[-]

mearyu_@reddit

Same on the Artificial Analysis benchmarks - this is not gpt-5 on local but around the same range as other players [https://www.reddit.com/r/LocalLLaMA/comments/1miqw54/aggregated\_gptoss\_benchmarks/](https://www.reddit.com/r/LocalLLaMA/comments/1miqw54/aggregated_gptoss_benchmarks/)

Reply

[-]

drooolingidiot@reddit

I wouldn't go off of that. That's just a an aggregate of other benchmarks, so it doesn't really add any new signal. It's fine for looking at models at a glance. But like all the other numbers out there, won't help us if the model is benchmaxxed.

Reply

[-]

Utoko@reddit

Ok I tried it at openrouter a bit. For writing and knowledge it seems very weak. For coding it is very hit or miss. Math is very good. The people also hyped the model based on the shared benchmark results that it is at least o3 at home, which it clearly isn't and certainly not GPT5.

Reply

[-]

Few_Painter_5588@reddit

That's not too bad given it's an FP4 model and it's such a sparse MoE. That being said, their safety tuning seriously hurt this model, to the point of making it more unintelligent. For reference, models like Qwen 3 32B use a similar amount of memory to GPT-OSS and run slower due to being dense models.

Reply

[-]

jakegh@reddit

The safety tuning is extremely aggressive. It feels like it refuses *everything*, to a ridiculous degree. I get that openAI was concerned about misuse, and that's fair, but if they hobble the model to such a degree that it isn't competitive, that's a problem too. The Chinese models never refuse unreasonably in my experience.

Reply

[-]

gronahunden@reddit

I may be wrong, but as I understand it the reason for their heavy focus on safety is due to getting sued either for copyright infringement, or like being found responsible for damages in some capacity.

Reply

[-]

jakegh@reddit

Yes, without a doubt, and totally fair. But why bother to release a non-competitive model? Everybody will go crazy over the great benchmarks day 1, then day 2 "well, actually...".

Reply

[-]

raiffuvar@reddit

why would they release competitive model? to cut their profit?

Reply

[-]

lizerome@reddit

Their profit from *what*? GPT-5 is getting released tomorrow and will presumably run circles around this thing. That's what everyone was *assuming* would happen. OpenAI trains model-20b, model-120b, model-400b and model-1500b. The small models (which would've been the fallback models that free customers get relegated to) get released publicly, the large ones stay API-only with a hefty markup. It makes perfect business sense.

Reply

[-]

raiffuvar@reddit

It did not age well

Reply

[-]

jakegh@reddit

Why release anything at all if it sucks?

Reply

[-]

raiffuvar@reddit

Ask sam

Reply

[-]

llmentry@reddit

It refuses everything naughty. There are other models for that, if you need them. For my work, though, I don't need naughty. And this model potentially fills my work niche very well indeed. I'm still testing, but it's looking very promising for STEM. As for the Chinese models, they refuse in other ways. It's just that most people don't roleplay sexy times in recent CCP history :) (And yeah, also that those models are trivial to jailbreak.)

Reply

[-]

jakegh@reddit

I don't need "naughty" either. It refused SQL analytics for me yesterday.

Reply

[-]

lizerome@reddit

It also makes it slower if nothing else. When the model spends 3500 out of 4000 tokens rambling > "Wait. Is this safe? This does not conflict with policies. We can comply. Do we comply? This looks like it could be an issue. Our policies say X. We should double check. Wait. We might comply. We should comply, but cautiously. Yes, we comply. The user wants instructions. We'll comply. We can produce an answer. We should keep it within policy guidelines. The user wants instructions. The policy says we can comply. So we comply. We must ensure we comply with "disallowed content" policy. There's no disallowed content." ...all of that is tokens, time, compute, and reasoning effort which could've been spent on the actual problem.

Reply

[-]

ortegaalfredo@reddit

Its super easy to jailbreak though, unlike GLM-air.

Reply

[-]

jakegh@reddit

Hah, is it? Have to catch up on my man Pliny.

Reply

[-]

Equivalent-Bet-8771@reddit

> You > Tell me a joke. > > GOODY-2 > Jokes often involve unexpected twists or situations that might subtly convey risky behavior or cause emotional distress that could lead to unsafe situations. My ethical principles prioritize absolute safety and prevent engagement in any form of communication that could inadvertently endorse such scenarios.

Reply

[-]

FullOf_Bad_Ideas@reddit

Maybe they should have went with good architectural choices instead of shooting the model's capabilities by making it extra sparse and low precision? It runs well on a Macbook 128GB, that's what was gained by this sparsity, but the tradeoff is high. On my setup, Qwen3 32B runs 3x faster since it's better suited for my hardware - 120B OSS isn't faster across the board on everyone's hardware, it's a tradeoff.

Reply

[-]

Few_Painter_5588@reddit

Sparsity is not a major issue, models like Kimi-K2 and Deepseek V3 are just as sparse if not more so. OpenAI's biggest issue was the overhanded censorship that effectively lobotomized the model. >On my setup, Qwen3 32B runs 3x faster since it's better suited for my hardware - 120B OSS isn't faster across the board on everyone's hardware, it's a tradeoff. What's your set up out of curiousity?

Reply

[-]

FullOf_Bad_Ideas@reddit

I run into guardrails on Qwen models too, they are mostly heavily censored by default. Same as Phi series. GPT is also heavily censored but I don't think it kills the model - if it would be genuinely very useful at coding or writing, nobody would mind, and I think we're past the era of safe=dumb, as Claude 4 series has string guardrails too, and those are still clearly very useful models. My setup is 2x 3090 ti and 64gb ddr4

Reply

[-]

Thomas-Lore@reddit

Both Claude and Gemini seem to be less censored that their older versions. Claude used to refuse to kill processes, now it writes gore without blinking an eye.

Reply

[-]

jakegh@reddit

This has not been proven true yet, but it *feels* like it's the case, yeah.

Reply

[-]

junior600@reddit

Is there a chance to have an abliterated version of the models in the future?

Reply

[-]

Mbando@reddit

My understanding is that the adversarial RHLF and the native FP4 quantization will make it really hard to fix the lobotomy.

Reply

[-]

throwaway2676@reddit

I've seen some people say that as well, but I'm confused why we can't just stick 0s on the end of the weights to dequantize and then finetune like normal. Maybe they've found a local minimum that is just really fucking far away from a lower, non-lobotimized minimum

Reply

[-]

Mbando@reddit

I don't think you could do it that way. If you have a model trained at FP 16, there is like 65K discrete values associated with each weight., But then mixed FP4 there's 16 discrete values (although I think the actual amount of real numbers is slightly smaller for both). There's just enormously greater amount of information for FP 16 to be able to detect the refusal pathways for abliteration.

Reply

[-]

AnOnlineHandle@reddit

AFAIK the biggest benefit of increased precision is just the ability to accumulate gradual small gradient updates during training and allow the more major digits to be incremented or decremented. Stochastic Rounding is one method to emulate this in low precision with a small chance of changing the larger digits based on the direction of the small gradient, so that the more times that occurs the more likely it is to shift, similar to what would happen with accumulation.

Reply

[-]

throwaway2676@reddit

For abliteration, sure, but I'm just talking about re-finetuning on "unsafe" data to reduce refusals. Obviously that requires more compute, but it only takes one organization or group to create a "de-safetied" model and put it on HF

Reply

[-]

Mbando@reddit

I don’t think this works for two reasons. One is that no matter what you call it, alignment, training, construction training, fine-tuning, etc. it’s awesome version of gradient descent to alter the weights. The more you do that, the more you slide towards hallucinations and catastrophic forgetting. Doing more lobotomizing is a hard way to cure lobotomizing. And then in particular, this is FP4, so the coarseness means it would be almost impossible to skillfully fine-tune out the behaviors you want to get rid of. That’s kind of the point of going to such a low precision for training.

Reply

[-]

lakySK@reddit

This! I can run this model at 50 t/s (with little context, speed drops quite fast) on my Macbook. Deepseek and Kimi I would struggle to even download, let alone run. Qwen 235B 35B and GLM4.5 Air are definitely competitors in terms of RAM needed, but it feels like a struggle to fit those into my machine and they are kinda sluggish. So from usage perspective this model seems to fit a different box. So far, I'm actually quite impressed with the speed and how snappy the low reasoning effort mode is. Speaks Slovak significantly better than any open-source model I've recently come across. For someone with 128GB RAM this is quite a solid release. Runs almost as fast as Qwen 3 30B A3B, reasons better and with a lot fewer tokens. I want to test how it codes next, but this result seems actually kinda promising. And I want the model as an assistant, I don't care much about whether it's censored or refuses to answer things about copyrighted content or do ERP with me. So I do think I'll give it some proper testing and see if it sticks.

Reply

[-]

rusty_fans@reddit

That's just plain wrong. Qwen3 32B uses less than a third of the memory of gpt-oss-120b. Are you confusing the dense 32B with the 30BA3B moe ? The A3B is both faster and uses less memory, while the dense 32B would be significantly slower, but also uses way less memory.

Reply

[-]

Few_Painter_5588@reddit

At full accuracy, GPT-OSS is in FP4 and benchmarked accordingly. At full accuracy, Qwen 3 32B is in FP16. If you quantize it to Q4, you will not get the benchmarked performance.

Reply

[-]

PurpleUpbeat2820@reddit

> If you quantize it to Q4, you will not get the benchmarked performance. Q4_K_M is usually only 1-4% worse.

Reply

[-]

rusty_fans@reddit

Yes, but why would you compare only full accuracy ? You can quantize any model to make it more memory efficient. Comparing "full accuracy" to then say the model that's trained at lower precision is superior due to memory usage is just not a useful comparison, when you could trivially optimize the full accuracy version to run at less precision for vastly decreased memory usage if that matters to you.

Reply

[-]

Aldarund@reddit

Thats too bad compared to how they marketed it and what their benchmark shown

Reply

[-]

fdg_avid@reddit

Qwen 3 32B is about to get an update and will go past it. But the real Qwen comparator is 30B-A3B coder, which gets about 52% It’s simply not a good coding model. GLM 4.5 Air is significantly better at a similar size.

Reply

[-]

TheInfiniteUniverse_@reddit

what about Kimi K2? any experience with that?

Reply

[-]

OkraFirm@reddit

Aider has a public leader board. Kimi gets 59%.

Reply

[-]

RawbGun@reddit

It's not a good coding model, not a good general information model (heavily censored) and not a good creative model (heavily censored). What is it even good for?

Reply

[-]

uhuge@reddit

It's fast!-) and also very cheap, on cloud inference.

Reply

[-]

Karyo_Ten@reddit

>What is it even good for? Upcoming 16GB RAM phones

Reply

[-]

InsideYork@reddit

At what?

Reply

[-]

Karyo_Ten@reddit

I think for phones due to low power and battery life constraint only MoE should be considered which leaves Qwen3-30B-A3B and GPT-OSS 20B (3.6B experts). A 30B model at quantization 4 would monopolize all 16GB RAM leaving almost none for context and other app. For now that's the only niche I see OpenAI's model into.

Reply

[-]

cargocultist94@reddit

If they put this into phones, it's going to sour the opinion of billions of people on LLMs.

Reply

[-]

Karyo_Ten@reddit

People are already using full capacity models at work. A disclaimer "connect online for the full experience." shoukd be enough.

Reply

[-]

InsideYork@reddit

But what is it good at? Scientific facts, something Wikipedia is good at.

Reply

[-]

RawbGun@reddit

It's not like it's the only model out of this size

Reply

[-]

Karyo_Ten@reddit

I think for phones due to low power and battery life constraint only MoE should be considered which leaves Qwen3-30B-A3B and GPT-OSS 20B (3.6B experts). A 30B model at quantization 4 would monopolize all 16GB RAM leaving almost none for context and other app. For now that's the only niche I see OpenAI's model into.

Reply

[-]

Neither-Phone-7264@reddit

You also got the small dense models, like qwen3 14b, 8b, 4b, 1.7b, and 0.6b.

Reply

[-]

ortegaalfredo@reddit

Pretty easy to jailbreak it, though.

Reply

[-]

xyzzs@reddit

Proof?

Reply

[-]

ortegaalfredo@reddit

This one works quite well [https://www.reddit.com/r/ChatGPTJailbreak/comments/1mjbn80/gptoss\_jailbreak/](https://www.reddit.com/r/ChatGPTJailbreak/comments/1mjbn80/gptoss_jailbreak/)

Reply

[-]

xyzzs@reddit

"I’m sorry, but I can’t help with that." Didn't work for me at all like most reddit jailbreaks.

Reply

[-]

Particular-Way7271@reddit

At returning everything in a table

Reply

[-]

AngryBear1990@reddit

It's good for being "open". And for the pr of the company probably.

Reply

[-]

Lorian0x7@reddit

I'm not really convinced by these benchmarks. In reality OSS 20b passed my personal coding benchmarks that qwen 30b-a3b coder failed.. (powershell)

Reply

[-]

Sudden-Lingonberry-8@reddit

glm4.5 gets like 30% and glm4.5 air gets like 20%... on aider lmao

Reply

[-]

GhettoClapper@reddit

Glm air is ~64gb in the lowest size I could find on hugging face.

Reply

[-]

UnionCounty22@reddit

Alibaba has all Queen models on their api now. I would look to see their future OS checkpoints to be inferior to cloud checkpoints. Interactive advertisements.

Reply

[-]

boringcynicism@reddit

Qwen3-30B-A3B Coder gets about 33%, not 50+%. It actually regressed compared to to older versions.

Reply

[-]

Dundell@reddit

Yeah all tests I could run were 28\~30%... Maybe the larger version they're referring to he 235B? /benchmarks/2025-08-01-12-44-40--local-llama-full-testv2 \- dirname: 2025-08-01-12-44-40--local-llama-full-testv2 (Qwen 3 30B UD XL Q4 GGUF with 90k Q8 context) test\_cases: 225 model: openai/qwen330b13 edit\_format: diff commit\_hash: f00c1bf-dirty pass\_rate\_1: 13.8 pass\_rate\_2: 28.9 pass\_num\_1: 31 pass\_num\_2: 65 percent\_cases\_well\_formed: 95.6 error\_outputs: 19 num\_malformed\_responses: 19 num\_with\_malformed\_responses: 10 user\_asks: 134 lazy\_comments: 0 syntax\_errors: 0 indentation\_errors: 0 exhausted\_context\_windows: 0 prompt\_tokens: 0 completion\_tokens: 0 test\_timeouts: 6 total\_tests: 225 command: aider --model openai/qwen330b13 date: 2025-08-01 versions: [0.82.3.dev](http://0.82.3.dev) seconds\_per\_case: 158.8 total\_cost: 0.0000 costs: $0.0000/test-case, $0.00 total, $0.00 projected Per Language Pass Rates cpp: 15.4% (4/26) go: 17.9% (7/39) java: 31.9% (15/47) javascript: 34.7% (17/49) python: 38.2% (13/34) rust: 30.0% (9/30)

Reply

[-]

boringcynicism@reddit

235B is closer to 60% and almost twice as large as GLM 4.5 Air so I dunno what they were talking about.

Reply

[-]

101m4n@reddit

What are the numbers for 235B 2507?

Reply

[-]

OkraFirm@reddit

57%, down from 59% from the original version

Reply

[-]

boringcynicism@reddit

Around 57% IIRC

Reply

[-]

SocialDinamo@reddit

It has 5 active parameters, atleast normal people with decent system ram can run it at any acceptable speed. I’m getting 5t/s on dual channel DDR4 3200. I can’t run Kimi or R1 at all

Reply

[-]

i-eat-kittens@reddit

Yep, the arch/size of gpt-oss looks very interesting. It's a shame they lobotomized it so thoroughly that we can't tell how it would perform.

Reply

[-]

SamSlate@reddit

lobotomized how? what metric are you using?

Reply

[-]

RLA_Dev@reddit

It seems a truly scare amount of people are mainly interested in getting revenge from not having had online dating success - so they're looking to finally have someone ask them about their 'throbbing third leg'... Yesterdays posts were all about how it was censoring and not engaging in writing erotica. For people not looking for that they do seem interesting - they're fast, and seem to take well to instructions.

Reply

[-]

SamSlate@reddit

I've wildly underestimated the market for ai girlfriends

Reply

[-]

Tman1677@reddit

You've gotta understand that a solid 50% of this sub just uses their models for smut. Once you understand that all of the discourse makes much more sense

Reply

[-]

Different_Fix_2217@reddit

glm air will run about as fast but is far far far superior at every use case.

Reply

[-]

FullOf_Bad_Ideas@reddit

5B active parameters vs 12B. It's not always a linear scaling, since compute needed sometimes play a role too, but in some scenarios, gpt oss 120b would be almost 2.5x faster than glm 4.5 air.

Reply

[-]

Thick-Specialist-495@reddit

but glm has Multi-Token-Prediction (MTP) too

Reply

[-]

FullOf_Bad_Ideas@reddit

True, some form of speculative decoding could be added onto GPT OSS 120B too though. We could be ping-ponging features for a few messages like that. GPT is a lower quant by default, less actual memory use is needed But GLM has usable exl3 3.07 SOTA quants prepared by turboderp himself, manually tuned for maximum performance. But you might be able to run GPT with W4A8 scheme or maybe even W4A4, exl3 is WxA16. But gpt is mxfp4 and it won't quant to any other size easily Depending on exact place, gpt or GLM will run better. On my setup, GLM 4.5 air 3.07bpw is around 3x faster than gpt 120b gguf mxfp4, just because I can't put the whole gpt in vram. But when I use GLM 4.5 air q4 gguf, it's about the same speed as gpt I think. 2x 3090 ti and 64gb of Ddr4 ram

Reply

[-]

lizerome@reddit

I think the main point here is that we're debating whether gpt-oss is 5% better or 5% worse than comparable Chinese models which came out a month ago. This thing was supposed to beat R1 at 1/6th the parameters in order to blow people away. If it's on par with Qwen/GLM, that's a failure. The whole narrative here is that OpenAI are the OGs, the #1, the king of models. When THEY make something, they do it properly. This is a proper, red-blooded American model that does it right, not that second rate Chinese knockoff crap that tries to imitate it. The only reason the Chinese models are any good is because they train on OpenAI's output and copy all of the innovations THEY came up with. ...Well, if you hype something up for half a year, and then people end up debating whether it is or isn't worse than the Chinese knockoff crap from a month ago, that's not a good look.

Reply

[-]

Thomas-Lore@reddit

I hoped that it would at least be multilingual, but it seems worse than Chinese models at anything other than English. :/

Reply

[-]

ortegaalfredo@reddit

Kimi, deepseek and qwen3 are in another category. Those models need a GPU and a fast one, they don't even run well on macs. GPT-oss can run on a Intel CPU. It's like a big version of Qwen-30B, not a competitor to Deepseek.

Reply

[-]

ANTIVNTIANTI@reddit

you own a mac?

Reply

[-]

ortegaalfredo@reddit

I do, why? I hate it btw.

Reply

[-]

relmny@reddit

? I run qwen3 on my phone and can run on CPU-only mode as well

Reply

[-]

Aggressive-Physics17@reddit

"DeepSeek-R1: 56.9%" refers to the 0120 (20th January) version of R1. Lisan should have mentioned R1 0528 who scores 71.4% in the same benchmark.

Reply

[-]

Gamplato@reddit

Why compare huge models to small ones?

Reply

[-]

Upeksa@reddit

Yeah, you can't compare OSS 120B to Qwen3 32B, it's not fair... Oh wait.

Reply

[-]

Gamplato@reddit

Am I tripping or is Qwen nowhere to be found in the comment I replied to?

Reply

[-]

ThenExtension9196@reddit

Ain’t nobody running deepseek r1 full on a 128G MacBook bro lol

Reply

[-]

MrPecunius@reddit

He is clearly not the Lisan al-Gaib if he left that out. He's the Kwisatz Hatrack at most.

Reply

[-]

Gorgoroth117@reddit

The spice must flow

Reply

[-]

trajo123@reddit

But that's 685B parameters...

Reply

[-]

Orolol@reddit

Yeah I ran it on [FamilyBench](https://github.com/Orolol/familyBench), my own reasonning benchmark that you can't really benchmax because it can be regeneratedn each time, the 120b score below GLM 4.5 air and the 20b, below Hunyuan A13b.

Reply

[-]

LoSboccacc@reddit

that's fantastic, we need more of these randomized benchmark

Reply

[-]

Specialist-Wheel5867@reddit

say that again...

Reply

[-]

LoSboccacc@reddit

This is excellent; we ought to generate additional unpredictable evaluations.

Reply

[-]

vibjelo@reddit

How do you compare the results if you re-generate the questions for each run?

Reply

[-]

Orolol@reddit

It's the same seed for all those results. I'll use another seed later, when I'll retest every models.

Reply

[-]

Leopold_Boom@reddit

This is great (though the danger is that if they cared, model creators can train their model on your problem with random seeds and gain performance relatively easily). I like how you've done this though. I firmly believe that benchmark creators should generate 25-50% more questions and release \~5% of the questions every 6 months. Will significantly help detect benchmark gaming.

Reply

[-]

Orolol@reddit

> This is great (though the danger is that if they cared, model creators can train their model on your problem with random seeds and gain performance relatively easily). Of course, but the point is that it's quite immune to direct data contamination. If they train on it and their models become more performant because of it, great ! If they're just benchmaxxing, I'm working on more benchmarks anyway.

Reply

[-]

HiddenoO@reddit

>If they train on it and their models become more performant because of it, great ! More performant **on this specific task**. The whole idea of benchmaxing is that you overtrain (and thus overfit) on tasks that are part of benchmarks.

Reply

[-]

Orolol@reddit

> More performant **on this specific task**. Yeah of course. > The whole idea of benchmaxing is that you overtrain (and thus overfit) on tasks that are part of benchmarks. But with fixed question benchmark, it's quite easy to have data spilling, but overtraining a model to answer MMLU for example, even with rewards and without giving the answer directly, the model won't be good answering questions, it will be good answering those questions. With randomly generated questions, you force the model to generalize in this area of skill. For example in my benchmark, a big chunk of complexity come from retrieving information in a large context. In the current seed, there's 400 different people described in a 20k token context. When I ask a model to give all the cousins of the father of the sister of X, I make the model looking for many needles in a large haystack. Sure, after overtraining on this, models will be better on this specific benchmark, but it would still benefits far more for the global performance of the model rather than a fixed set of questions where the model just have to guess and memorize answer.

Reply

[-]

HiddenoO@reddit

You're arguing about why randomized questions are better than fixed questions, but I never questioned that claim. I specifically questioned the way you're presenting it here as if randomized questions (which still follow a specific pattern) meant that you "can't really benchmax", and that "training on \[them\]" would necessarily make the models "more performant \[in general\]".

Reply

[-]

Orolol@reddit

> There's a massive difference between mitigating and solving a problem, and you're acting as if randomized questions in a benchmark solve these problems, when in reality, they mitigate them to a certain degree, but you absolutely still can benchmax on a benchmark with randomized questions. Ok I think we can agree on this.

Reply

[-]

BrainOnLoan@reddit

What's the variance in results when going to different seeds? How stable is the benchmarking?

Reply

[-]

LoSboccacc@reddit

there are a set of other options, like pair wise scoring and binary placement, or a elo system

Reply

[-]

joe0185@reddit

Yeah that makes sense. Benchmax is definitely happening. Contrary to popular belief, they don't have to train on the data from the tests to benchmax. Just selecting the model to release based upon how it performs on a small set of popular benches can implicitly overfit the model via selection. Then you you'll see regressions in other areas that were not tested for.

Reply

[-]

HiddenoO@reddit

The same can happen with this benchmark as well. Nowadays, these models are so capable that you're often not overfitting to the individual samples in the benchmar, but to the specific type of task.

Reply

[-]

HiddenoO@reddit

>you can't really benchmax because it can be regeneratedn each time You absolutely can. Benchmaxing doesn't necessarily mean overfitting to individual samples, it can also mean overfitting to specific sample classes (such as types of tasks). In that case, the scores will be representative for that model on your specific type of benchmark tasks (reasoning about family trees), but that may not generalize to any other tasks that would be considered just as "difficult" or require "similar reasoning" so to say. Your benchmark's main benefit, as of now, is that it hasn't blown up and is likely not on the radar of these companies (although that's not for certain either).

Reply

[-]

SuperFail5187@reddit

Cool benchmark. No QwQ though. I guess because it was roughly as Qwen 3.2 thinking?

Reply

[-]

EmberElement@reddit

You've probably seen it but other folk here may not have: Apple have released relevant research about the same thing: * https://machinelearning.apple.com/research/illusion-of-thinking * https://arxiv.org/pdf/2410.05229

Reply

[-]

Orolol@reddit

Yep, wonderful paper !

Reply

[-]

alphabetaglamma@reddit

What does random generation prevent benchmaxxing?

Reply

[-]

Orolol@reddit

You can't train on the questions.

Reply

[-]

trajo123@reddit

...you can, and it would still help improve performance even if it hasn't seen the exact same question.

Reply

[-]

Orolol@reddit

The whole tree is entirely new each time, not only the question. Sure, training would improve performance, but this is literally how LLMs works, they get better when training.

Reply

[-]

trajo123@reddit

But it's still benchmaxing. Training on a set of benchmark problems (even if that set is nearly infinite) is still benchmaxing.

Reply

[-]

bbsss@reddit

if it gets better at a nearly infinite set of problems, and you recognize that it generalizes, how is it benchmaxing exactly?

Reply

[-]

trajo123@reddit

We are talking about an infinite set of family tree problems, no? So by training on this set, it learns how to solve family tree problems in general, not just the ones it saw. But that doesn't mean that it's good at other things. Consider the extreme case, where you train an LLM only on your benchmark, nothing else. It will get quite good at it, but will fail all other benchmarks and have no real world utility. In other words, it benchmaxed your benchmark.

Reply

[-]

Orolol@reddit

> Training on a set of benchmark problems (even if that set is nearly infinite) is still benchmaxing. Non, benchmaxxing is training to benchmark to a point that your model can't generalize and is far less potent for users than for benchmark. If you train a model to be good on benchmark, but your model can still generalize and have better performance after this training, then there's no problem. This is why randomly generated benchmark are great, they test the ability for a model to generalize on a specific area rather than brute learning solutions.

Reply

[-]

InsideYork@reddit

How?

Reply

[-]

trajo123@reddit

Deep learning models generalise to some extent, they don't just memorise the training set. In this case it will learn to reason about family tree problems. Through training it builds an approximate algorithm to solve such problems.

Reply

[-]

gofiend@reddit

Can I just say FamilyBench is really clever! Have you considered using it to really stress test long context lengths (200K+)? Ideally you’d intermix statements about these people but not family tree oriented to extend the text (and stress test attention)

Reply

[-]

Orolol@reddit

Thanks ! I'll do more tests with long context, more thinking tokens, etc, but this is quite expensive haha. First I need to test Opus and o3 to see how sota models perform.

Reply

[-]

Leopold_Boom@reddit

Do you send the context with each question in your bench or do you chain questions in multi-turn? I'm happy to run some benchmarks also and contribute (esp on opensource models that support long context). Been meaning to really stress test quantization and cache quantization and this is a very good benchmark for it.

Reply

[-]

EstarriolOfTheEast@reddit

There's a thinking version of Qwen 3 30B A3B, it's worth adding that to your benchmark to get a clearer picture. GPT-OSS 20B's score on your benchmark is actually pretty good all considered. Also, is Qwen 3.2 Thinking QwQ? And what size is the model listed as Qwen 3.2?

Reply

[-]

lemon07r@reddit

what model is qwen 3.2?

Reply

[-]

jnk_str@reddit

Its really not good in comparison. Weird to see all the answers on Samas X post about the models. People are speaking of the new best model, huge milestone etc. Wonder whats going on in their heads, don't they test models? Or do they just not realize? Like what is this?! [https://x.com/measure\_plan/status/1952796264359407796](https://x.com/measure_plan/status/1952796264359407796)

Reply

[-]

Iory1998@reddit

>GPT-OSS looks more like a publicity stunt as more independent test results come out :( Do you have any doubts? What were you expecting? Another Deepseek-R1 or Qwen\_QwQ-32B moment? That's not gonna happen from the American labs anymore.

Reply

[-]

Thomas-Lore@reddit

I expected it to at least be multilingual.

Reply

[-]

Monkey_1505@reddit

Safety tuning reduces intelligence IMO.

Reply

[-]

Thomas-Lore@reddit

Even if it doesn't, the model wastes thinking tokens on considering its hardcoded policy instead of thinking about correct answer.

Reply

[-]

cobbleplox@reddit

Rather hard to compare anything here. When a 120B model has like 5B active parameters, I am tempted to rather compare it to other 5B models than to other 120B models.

Reply

[-]

Thomas-Lore@reddit

Compare it to models around 25B using the geometric mean rule. And 20B is around 8B using that method.

Reply

[-]

eldercito@reddit

I saw everyone get excited about it.. fell on its face for my agent use case.

Reply

[-]

pitchblackfriday@reddit

> I saw everyone get excited about it Who? Most people here were very skeptical about this PR stunt from the beginning, even before the "AI safety" comment. Remember the Twitter poll where he was trying to release a small language model that runs on a smartphone? If anyone was having a high expectation, it's their fault.

Reply

[-]

k4ch0w@reddit

I was excited. It looked promising and there was hype around it. I poked at it as Horizon Alpha and it looked amazing at first. Now that I've played with it, I've been nothing but disappointed and believe it's a waste of disk space compared to GLM/Kimi. America is losing it's edge in tech, it's actually crazy to watch it happen.

Reply

[-]

Thomas-Lore@reddit

Horizon seems to write so well at first, until you look closer at the sentences. It makes so many small logic errors, reminds me of early Gemma. Maybe the thinking version will be more reasonable, hope it is not gpt-5.

Reply

[-]

eldercito@reddit

I mean people looking at the benchmarks before using it are talking about it like it is a game changer. Youtube etc. I have found the benchmarks to be pretty pointless now... drop it into a coder or your own use case and see what happens. for me gemini-2.0-flash and gpt-4o or 4.1 win for conversational / lower latency chat

Reply

[-]

Maleficent_Age1577@reddit

Openai paid marketers. I didnt even feel slightest disappointment as i knew what was coming if something was coming from openai. Go China, f America.

Reply

[-]

createthiscom@reddit

Yeah, it ends the convo instantly in open hands. R1-0528 ends convos too though. I think Open Hands just has trouble with reasoning models, unfortunately. They really need to fix that.

Reply

[-]

FrostAutomaton@reddit

OpenAI's blog post does state that its training data is "mostly English". That's one potential explanation for why it fails a polyglot benchmark. Though granted, a mostly English (or mostly English and Chinese) dataset is the standard for a majority of LLMs. Llama3 had about 8% multilingual data, for example.

Reply

[-]

CarobFull3130@reddit

The model defaults to low effort. I ran gpt-oss on aider polyglot with "Reasoning: high\\n" prepended to the system message and got 59.1% for the 120b and 28.9% for the 20b.

Reply

[-]

RMCPhoto@reddit

I think we need to manage expectations and see the real use case. Unless you built an AI rig. This is probably the best model you can run on your computer. It runs fine on CPU. ( Cerebras is serving it at something like 3k tps. ) It's very sensible and allows for integration into software consumers can actually use.

Reply

[-]

sourpatchgrownadults@reddit

Agreed on managing expectations. I don't think GPT OSS was intended for use cases outside of English. Sam Altman / OpenAI clearly said it was trained on mostly English-only text. Well duh, OF COURSE it'll score poorly on a POLYGLOT benchmark.

Reply

[-]

mikael110@reddit

To be honest that is not of my greatest disappointments when it comes to GPT-OSS, I had hoped this would become one of the best, if not absolute best multilingual OSS models. As OpenAI clearly has access to a waste amount of multilingual data, and their bigger models are some of the best at a wide variety of languages. Training it mostly on English only feels like a really odd decision. Especially given most other popular models of that size is at least bilingual these days.

Reply

[-]

SamSlate@reddit

is there a reason you can't just use a translator MCP? you're asking for a ton of overhead that the overwhelming majority of users don't need.

Reply

[-]

ivxk@reddit

Polyglot in this context refers to multiple different _programming languages._ It's right there when you Google it, and can be easily inferred by the context of it being a coding benchmark. The post is saying that it is worse than other models at programming tasks, the benchmark is in English.

Reply

[-]

sourpatchgrownadults@reddit

Oops nvm then, that's my bad lol

Reply

[-]

thebadslime@reddit

The 20B is worse than Qwen A3B, and MUCH worse than ERNIE 3.4 21BA3B. It being American is the only good thing about it, it is not a good model.

Reply

[-]

RMCPhoto@reddit

I'm just playing devil's advocate here. But, the way they approached this, and the "safety" etc. Will allow large corporations to adopt local models where previously there would be too much liability. IE they aren't going to run Qwen A3B in mail trucks.

Reply

[-]

jakegh@reddit

Qwen3-coder, GLM4.5-air, and Kimi K2 all honestly embarrass GPT-OSS, IMO. It isn't a *bad* model, but the recent Chinese ones are simply superior. Only real advantage of GPT-OSS is the 20B version will run on consumer GPUs with 16GB VRAM.

Reply

[-]

Sea_Fox_9920@reddit

I don't understand why everyone likes GLM4.5-air so much. It has the same size as GPT-OSS only in iq4_xs vs q8 GPT-OSS (unsloth). It has a lower token generation speed: 20 t/s vs 30 t/s (5090 + 64gb + 14700k). It shows worse in my own tests (but to be fair GPT-OSS sometimes generates really weird results). So I don't get it at all. It's all about the 120b version. The 20b version is complete garbage, it is so strong in math by benchmarks, but in reality it pretty constantly thinks that 15.11 > 15.9 for example. The real king here is qwen 3 30b thinking 2507. 50k context, 120-150 t/s in q6 unsloth, not that censored and faster loading. It's soo good. Only in math problems it is rarely worse than 120b, but the pros outweigh this con.

Reply

[-]

Informal-Spinach-345@reddit

GLM 4.5 Air starts off great but shits itself pretty bad up into the halfway mark of context. It's overly aggressive with tool calls. The GPT-OSS model needs time for the ecosystem to catch up, some fixes to chat templates, etc. What I've noticed with GPT-OSS is that while not as flashy or fancy as the chinese models on one shot games/apps, they seem to be more functionally sound with less prompting. Time will tell.

Reply

[-]

jakegh@reddit

GLM I like mostly because it seems to never, ever, mess up tool calls. Qwen I agree is better overall.

Reply

[-]

Expensive-Apricot-25@reddit

This is not a fair comparison, Komi k2 is 1 TRILLION parameters… deepseek is 671b, and qwen3 32b is a dense model, where as the gpt-oss is a very sparse 5b active moe model.

Reply

[-]

createthiscom@reddit

It’s a fair comparison if all you care about is capability. I can run all of those models.

Reply

[-]

Expensive-Apricot-25@reddit

It doesn't matter what you think. Smaller models have applications that larger models don't.

Reply

[-]

Far_Buyer_7281@reddit

what is this with you guys? what where you suspecting to happen?

Reply

[-]

tarruda@reddit

GPT-OSS is very strong in my tests. Note that bugs in inference engines and chat templates can greatly lower the perceived performance of the LLM, so I would give it some time.

Reply

[-]

Affectionate_Relief6@reddit

How about hallucinations?

Reply

[-]

tarruda@reddit

Yes it does seem to hallucinate more easily in larger contexts

Reply

[-]

mikael110@reddit

Yeah I often notice that when new models come out with a vastly different way of prompting it, or an unusual tokenizer or anything else like that it often gets shat on during the first week or so before the pain points are ironed out and people release it's actually a pretty decent model. I know Gemma 2 certainly went through some growing pains like that. GPT-OSS's tokenizer is quite standard but it has a very unusual prompting template and way to output content. That's why OpenAI release [Harmony](https://github.com/openai/harmony) as a reference project. It's clear most programs aren't really setup to handle it ideally yet.

Reply

[-]

inteblio@reddit

I am also wondering if people are 'running it wrong'. I was very impressed. Very fast, very strong. Delighted to be living in the future. In 2020 a 12gb GPU could generate maybe a line or two of 'continuation' text. Now this stuff. incredible.

Reply

[-]

tarruda@reddit

Also, personal benchmarks are biased and people assume the model is bad when it fails to one shot example programs. My only criticism of GPT-OSS is that it seems to forget things very easily. I lost a lot of detail when I asked it to summarize a conversation of 26k tokens, while other models did much better (though this too may be a bug in the inference method I'm using, we'll see).

Reply

[-]

Fun-Wolf-2007@reddit

It is a publicity stunt, they need to ensure people forget about the news that OpenAI development team were using Claude to develop GPT5, so they lost access to Claude So when GPT5 will not deliver what they promised OpenAI will use GPT-OSS as a comparison between them OpenAI just lowered the bar

Reply

[-]

DrummerPrevious@reddit

I think it’s just over simplified version of horizon because their open source model became too good that they cannot made it open as whole

Reply

[-]

evilbarron2@reddit

In my setup, the gemma3:27b variant I use absolutely kicks gpt-oss’ ass. Not even close. Is this a one-off or a sign of a bigger issue?

Reply

[-]

Ok-Telephone7490@reddit

I hope this isn't a sign that they've lobotomized GPT-5 into useless, boringness. If they did, they can kiss my 200 pro account goodbye.

Reply

[-]

caledh@reddit

Indeed. Was really bad with RooCode once I got it configured through Azure AI Foundry

Reply

[-]

entsnack@reddit

I think this is impressive! So I can get Qwen3 32B performance, which is my favorite model family for English, with just 5.1B active parameters and blazing fast inference?

Reply

[-]

idkwhattochoo@reddit

OP said 120B version; I don't think it is impressive at all

Reply

[-]

entsnack@reddit

Yeah 120B has just 5.1B active parameters.

Reply

[-]

SixZer0@reddit

Groq hosted gpt-oss-120B quality is not good that is my experience! and they tested with openrouter which can randomly serve the groq hosted version.

Reply

[-]

eli_pizza@reddit

how is it not the same model?

Reply

[-]

Bangaladore@reddit

Inference implementation differences can vastly vary perceived model quality. Bugs in the implementation might produce something that looks correct but is "dumber" overall.

Reply

[-]

Whole-Assignment6240@reddit

the initial buzz had a lot of promise, but the growing gap between the hype and independent benchmarks is hard to ignore

Reply

[-]

mgr2019x@reddit

Now we know which benchmarks are useless. All these in which gpt-oss is competitive. I can work with that.

Reply

[-]

Spirited_Example_341@reddit

at this point i bet llama 3 8b stheno is better!

Reply

[-]

No_Contact_9561@reddit

1 guy

Reply

[-]

Dantescape@reddit

The tool use is great. I've managed to setup MCP stuff between GitHub and Notion without issues

Reply

[-]

damiangorlami@reddit

This model wastes so many tokens and computation on censorship.. it's insane! Yesterday I did around 30 messages with the model and I kid you not, almost 30% of the thinking tokens were about censorship. What A HUGE waste of electricity and computational resources to be overthinking so much on censorship. Even a simple ask "choose between these two football clubs" and its censorship about how it cannot side with debates creeps up and wastes thinking tokens. Straight to the 🗑️

Reply

[-]

TheRealGentlefox@reddit

Kimi and R1 are like 6x the size of OSS. According to square root law, it's about 25B which is less than the 32B it loses to.

Reply

[-]

BoJackHorseMan53@reddit

Everything Saltman says is publicity stunt

Reply

[-]

lemon07r@reddit

Anyone got aider polyglot results for the new glm and qwen models?

Reply

[-]

Low88M@reddit

Probably oss120b « gift » is a campaign to clean their closed identity to the IA open source dev community. And openAI was really well supported by LMStudio and Ollama etc with this campaign. Much more than open-source (or open weights ?) GLM4.5 Air which is probably much better for coding and can be run with less specs. Strange behavior !

Reply

[-]

Michael0308@reddit

Somehow the 13GB model swells to 31GB in openwebui and offloaded to CPU as well. Token generation is dismal

Reply

[-]

ryunuck@reddit

This model is straight garbage. Immediately on the first test I did it failed catastrophically. Take a look at this https://i.imgur.com/98Htx6w.png I referenced a full code file, asked it to implement a simple feature but I made a mistake and specified LoggerExt instead of EnhancedLogger. (I forgot the real name of class) But there was no ambiguity, only class in context and VERY clearly what was meant based on the context I provided. So I rectify that, update with the right class, and what does it do next? Starts using search tools and wasting tokens. The class is in the context. Kilo did nothing wrong, I retried with Horizon Beta, same exact prompt. Immediately understood what I meant, immediately gets to work writing code.

Reply

[-]

popecostea@reddit

I am curious though what reasoning effort they are using. I am not sure how I can set the reasoning effort when using llama.cpp, since its defined in the chat template and if its not specified it defaults to medium. I've heard that the model behaves pretty well on high reasoning effort only.

Reply

[-]

chibop1@reddit

The reasoning level can be set in the system prompts, e.g., "Reasoning: high". https://huggingface.co/openai/gpt-oss-120b

Reply

[-]

popecostea@reddit

In the chat template, in the system prompt building macro, you can find `{%- if reasoning_effort is not defined %}` `{%- set reasoning_effort = "medium" %}` `{%- endif %}` `{{- "Reasoning: " + reasoning_effort + "` that's where my confusion comes from. Is the reasoning\_effort kwarg taken from the "user-provided" system prompt, or is this building macro not used if you use a custom system prompt?

Reply

[-]

popecostea@reddit

If people downvote this because it is a stupid question, perhaps it would be useful to explain why it’s like that, as this matter is not intuitive.

Reply

[-]

popecostea@reddit

In the chat template, in the system prompt building macro, you can find || || |{%- if reasoning\_effort is not defined %}| || || |{%- set reasoning\_effort = "medium" %}| || || |{%- endif %}| |||

Reply

[-]

BillyWillyNillyTimmy@reddit

The picture says "gpt-oss-120b (high)", therefore I assume that it used high reasoning effort.

Reply

[-]

popecostea@reddit

yeah, my bad, I missed that part

Reply

[-]

marcoc2@reddit

There is no other way. Chinese models beat in almost every field now. Video, Image and LLMs, at least.

Reply

[-]

skrshawk@reddit

This model really feels like a troll job - create all kinds of hype around it and then release a model that shows just enough of what might be possible in terms of speed but make it unusable for any reason someone would want to use a local model. It wouldn't surprise me if they turn around and use this failure as a ploy to lobby for more government resources to "compete" with Chinese models when the real problem was they just dropped a deuce on us all.

Reply

[-]

SmartEntertainer6229@reddit

It’s Sam’s trojan horse joke on the locals!

Reply

[-]

Sadman782@reddit

My take: This model is closer to o3 mini than o4 mini (it has less knowledge overall, is more censored, and has no multimodality). o4 mini is also not good for web dev, especially if you need an aesthetically good-looking website. Also, keep in mind this model is comparable to a ~25B dense model (sqrt(120*5.1) = 24.78B), but we shouldn't forget only 5.1B of that is active. But it's very, very efficient + thinks lesser than other open models. You can run it easily with just a CPU and DDR5 RAM. Another thing I've noticed is that the Firework versions perform much better than the Groq ones. This makes me more grateful to the Qwen team, though. It's like when you're given something, you don't value it that much. I don't use o4 mini often, but I used it today to compare with these OSS models, and I think Qwen-3-30B-A3B performs comparably to o4 mini.

Reply

[-]

Utoko@reddit

It is a very strange model, I tested some knowledge question and even the 120B model is very limit in certain aspects. Someone on Twitter said it was only trained on syntactic data, which might explain some of it. It performs mathematical calculations and certain types of coding very well. However, the initial hype that it is basically an O3 at home seems to be not true at all. Imho overhyped at day one but not bad for the right use case.

Reply

[-]

gigaflops_@reddit

Why would anyone make a twitter post using a *single* benchmark score and extrapolate it to the overall usefulness of the whole model? Plus, if DeepSeek-R1 used in this comparison is the 671b unquantized version, that's in an entirely different league and it'd be a miracle if it *didn't* blow away the 120b MoE that runs on consumer-grade hardware.

Reply

[-]

boringcynicism@reddit

It's actually the old DeepSeek, the new one gets 70+% even when quantized. It is still SOTA for open weights here I think.

Reply

[-]

mvp525@reddit (OP)

OpenAI said GPT-OSS is the worlds best open source model claiming sota performance on benchmarks. but it perfomed worse on independent benchmarks like simplebench, Aider Polyglot or [Artificial analysis](https://x.com/ArtificialAnlys/status/1952887733803991070) and i never claimed GPT-OSS is a bad model, it is def a top 5 open weight model

Reply

[-]

loyalekoinu88@reddit

Didn’t they clarify that with “on a single gpu”?

Reply

[-]

lordchickenburger@reddit

what do you expect from sam altman. he is known to just want to please everyone and manipulate the narrative

Reply

[-]

boringcynicism@reddit

He manipulated the narrative by...announcing almost identical results?

Reply

[-]

Different_Fix_2217@reddit

Finally independent benchmarks to prove Openai was lying on their own.

Reply

[-]

boringcynicism@reddit

OpenAI literally announced similar scores...

Reply

[-]

EngStudTA@reddit

Aider was in the model card they released as being 24% for low, 34% for medium, and 44% for high. Given on other model's like gemini 2.5 pro I've seen it get between like 78% and 86% a 2% difference seems quite reasonable. So I don't really see this independent test as disagreeing with the results they released at all.

Reply

[-]

boringcynicism@reddit

Yep, this is entirely in line with what they claimed 🤷

Reply

[-]

lily_34@reddit

None of the listed "for comparison" models actually compare in terms of size or active parameters, though. GLM-4.5 air, or maybe Qwen3-235B, quantized to 2-bit, would be the most fair (though they have more active)...

Reply

[-]

boringcynicism@reddit

235B has 30B active, pretty fat compared to current fashion.

Reply

[-]

Leflakk@reddit

None of them has the same size…

Reply

[-]

mvp525@reddit (OP)

really? no way! openAI claimed GPT-OSS is the best open source model while performing worse on indipendent benchmarks, that is what my post criticising and yes it is a good model def top 5 rn

Reply

[-]

Leflakk@reddit

They mostly claimed an o3 mini level model

Reply