Any tips/Advice for running gpt-oss-120b locally
Posted by gamesntech@reddit | LocalLLaMA | View on Reddit | 4 comments
I have an RTX 4080 (16GB VRAM) with 64 GB RAM. I primarily use llama.cpp. I usually stay away from running larger models that do not fit within the GPU (I use Q4_K_M versions) because they're just too slow for my taste (I also don't like my CPU spinning all the time). Since the 120b definitely does not fit on my GPU I want to at least test it with offloading. Seems like there are specific flags and layer specifications that are more useful in this scenario so I'd greatly appreciate it if anyone has all the options that worked reasonably for them with 16GB VRAM.
QbitKrish@reddit
Just run --cpu-moe 36 to offload all expert layers to the CPU and --n-gpu-layers 999 to offload everything else to gpu, along with whatever context amount you’re looking for and flash attention. If you have some VRAM headroom (which you should, this takes less than 8GB of vram for me) you can reduce the number of layers offloaded by --cpu-moe to increase tokens per second (i.e --cpu-moe 28) until your vram is full. I have a much weaker setup (3060ti with 8GB of vram), but was able to operate at a usable 7-10 tokens per second with 64gb of DDR4 ram, so you should be able to achieve a pretty healthy speed.
gamesntech@reddit (OP)
This is very helpful, thank you! I was mostly using the Q4 version because they save additional memory but I will try the f16 version as well.
chisleu@reddit
Please post up your benchmarks. Just ask it to write a paper on the negative effects of gooning or whatever.
I get about 60 tok/sec on a M4 Max Macbook Pro. I'm using the official 4 bit openai/gpt-oss-120b release.
rbgo404@reddit
Check this template for the 20b model:
https://docs.inferless.com/how-to-guides/deploy-openai-gpt-oss-20b