OpenAI-GPT-OSS-120B scores on livecodebench

Posted by Used-Negotiation-741@reddit | LocalLLaMA | View on Reddit | 6 comments

Has anyone tested it？Recently I locally deployed the 120b model but found that the score is really low(about 60 on v6),and I also found that the reasoning: medium setting is better than reasoning: high, it is wired.（the official scores of it have not been released yet).
So next I check the results on artificialanalysis(plus the results on kaggle), and it shows 87.8 on high setting and 70.1 on low setting, I reproduce it with the livecodebench-prompt on artificialanalysis ,and get 69 on medium setting, 61 on high setting, 60 on low setting(315 questions of livecodebench v5,pass@1 of 3 rollout，Fully aligned with the artificialanalysis settings)
Can anyone explain?the tempeture is 0.6, top-p is 1.0, top-k is 40, max_model_len is 128k.(using the vllm-0.11.0 official docker image)
I've seen many reviews saying this model's coding ability isn't very strong and it has severe hallucinations. Is this related?

[-]

egomarker@reddit

Which model file exactly do you use.

Signal_Ad657@reddit

I test it by telling GPT-5.1 it has to grade an unknown model and make a variety of prompts to test it. Then at the end has to guess the model. It always scores really well, and it usually guesses that it’s talking to GPT-4o or Claude Sonnet 3.5

AXYZE8@reddit

You are not using recommended settings.

https://docs.unsloth.ai/models/gpt-oss-how-to-run-and-fine-tune#running-gpt-oss

Used-Negotiation-741@reddit (OP)

okay,actually I've tried many temperature settings(from 0.6 - 1.2) and top-k settings(40, -1), I got similar results(all within 3 points.), the reason why I chose temperature=0.6 was that I want to align the result from artificialanalysis.
But I indeed didn't try top-k=0 before, I'll give it a try now,thanks for sharing this website

Aggressive-Bother470@reddit

Two things.

It looks like you're using qwen params for gpt.

I've observed but not measured slightly subpar outputs in vllm when using 'high' vs lcpp.

ravage382@reddit

Coding is actually decent for python. Agree you need to adjust your settings