New Local LLM Rig: Ryzen 9700X + Radeon R9700. Getting ~120 tok/s! What models fit best?
Posted by jsorres@reddit | LocalLLaMA | View on Reddit | 11 comments
Hi ! I just finished building a workstation specifically for local inference and wanted to get your thoughts on my setup and model recommendations.
•GPU: AMD Radeon AI PRO R9700 (32GB GDDR6 VRAM)
•CPU: AMD Ryzen 7 9700X
•RAM: 64GB DDR5
•OS: Fedora Workstation
•Software: LM Studio (Vulkan backend), wanna test LLAMA
•Performance: Currently hitting a steady \~120 tok/s on simple prompts. (qwen3.6-35b-a3b)
What is the largest model architecture you recommend running comfortably? Should I be focusing on Q4_K_M quantizations ?
putrasherni@reddit
qwen 3.5 35B Q5_K_XL
jsorres@reddit (OP)
Thx, that's interesting ☑️
jsorres@reddit (OP)
Context window size, 32k ?
putrasherni@reddit
you can go all the way to full 262144 context with qwen 3.6 qwen 35B Q4
and 131072 with Q5.
Make sure to set the correct flags as per qwen huggingface page for your use case
you should get TG4096 around \~150-175 tok/sec and hit PP16384 around 3-4K on qwen 3.6 35B Q4
jsorres@reddit (OP)
That's insane, I have 106 tok/sec now with 131072 with Q5. (LMStudio) Thanks for this answer !
gasgarage@reddit
same rig here. lemonade server+claude code plugin+qwen3.6 Q4_K_XL unsolth gguf on vulkan works quite nice to me.
Basically you run it with 'lemond', in another terminal 'lemonade launch claude', it will ask you which model and there it goes.
jsorres@reddit (OP)
I don't know this tool, lemonade server - I'll take a look, thx for your contribution ☑️
Opteron67@reddit
which quant ?
jsorres@reddit (OP)
Q4_K_M , 22Gb
oxygen_addiction@reddit
The general rule is = run the largest quant you can with whatever max context you need.
Q4_K_M is the best size/performance tradeoff but getting closer to Q8 will lead to better overall performance.
You can read this about 3.5 - https://kaitchup.substack.com/p/summary-of-qwen35-gguf-evaluations
jsorres@reddit (OP)
Thanks for your answer, much appreciated. This is the model and quant that I'm using. I'm using 49K context window size, which seems plenty but.. never enough I think. Going with Q5 would force me to go down to 32K, right ?