Any tips/Advice for running gpt-oss-120b locally

Posted by gamesntech@reddit | LocalLLaMA | View on Reddit | 4 comments

I have an RTX 4080 (16GB VRAM) with 64 GB RAM. I primarily use llama.cpp. I usually stay away from running larger models that do not fit within the GPU (I use Q4_K_M versions) because they're just too slow for my taste (I also don't like my CPU spinning all the time). Since the 120b definitely does not fit on my GPU I want to at least test it with offloading. Seems like there are specific flags and layer specifications that are more useful in this scenario so I'd greatly appreciate it if anyone has all the options that worked reasonably for them with 16GB VRAM.