TheaterFire

Anyone was able to run gpt-oss 20b on a 5090?

Posted by celsowm@reddit | LocalLLaMA | View on Reddit | 18 comments

Hi! I tried using the new one vllm docker image but I got "Sinks are only supported in FlashAttention 3" Any hints?

Reply to Post

18 Comments

anzzax@reddit

works fine in updated LM Studio, openai/gpt-oss-20b, 205.70 tok/sec on simple prompt
View on Reddit #63474596

Great_Guidance_8448@reddit

I am getting around 90 tok/sec when generating code and 330 tok/sec summarizing text on my laptop 5090 24gig
View on Reddit #63577939

Chance-Studio-8242@reddit

I am curious how the thermals are when running llms on laptop 5090
View on Reddit #65378679

General-Cookie6794@reddit

Lol
View on Reddit #67101086

Chance-Studio-8242@reddit

Any luck here with 5090 and vllm?
View on Reddit #65378637

celsowm@reddit (OP)

Yes, now its working very well
View on Reddit #65381783

Great_Guidance_8448@reddit

Works fine with the latest LM Studio on my laptop' RTX 5090 24 gig
View on Reddit #63578101

Sorry_Ad191@reddit

thats llama.cpp backend, we are trying to get it to work in vllm or sglang because the throughput is 10x compared to llama.cpp
View on Reddit #64431109

mxforest@reddit

How much context are you able to fit on a single 5090? I mean total context across multiple users?
View on Reddit #64478949

Sorry_Ad191@reddit

Any news? other than llama.cpp working? I mean for vLLM or Sglang?
View on Reddit #64431163

celsowm@reddit (OP)

No :( only llamacpp and ollama yet, the problem is triton and flash attention
View on Reddit #64431303

Sorry_Ad191@reddit

someone somewhere must have figured it out :-) if not soonTM
View on Reddit #64441556

sleepingsysadmin@reddit

In my testing on lm studio, you cant have flash attention enabled. This model has a different sort of attention system going on. The lm studio blog explains it.
View on Reddit #63477008

Green-Ad-3964@reddit

I get an error with ollama (5090 here)
View on Reddit #63473938

ForsookComparison@reddit

Llama CPP's update to run gpt-oss models was only merged into Ollama 3 hours ago. If you haven't updated since then, you'll need to do so.
View on Reddit #63474217

Green-Ad-3964@reddit

I updated before testing...yet I get the error https://preview.redd.it/mzkq0j131ahf1.png?width=1204&format=png&auto=webp&s=84ad904e61c218d9b4f37c7aafbee958ca4fd93e
View on Reddit #63474480

ForsookComparison@reddit

download llama cpp so you can see in the foreground what's actually failing
View on Reddit #63474815

Prestigious_Thing797@reddit

Failing on my Pro 6000 too. Seems blackwell support isn't working like they announced :( Especially frustrating since the model was designed to leverage the fp4 compute blackwell has [https://blog.vllm.ai/2025/08/05/gpt-oss.html](https://blog.vllm.ai/2025/08/05/gpt-oss.html)
View on Reddit #63474081