Any good MOE ~60B models? I have 64GB vram
Posted by opoot_@reddit | LocalLLaMA | View on Reddit | 32 comments
I have a build with 2 x MI50 32GBs and 64 gigs of DDR4 (bought before rampocolypse for \~630 USD total, I’m not rich) and I’m not gonna upgrade it for a long while.
Are there any good MOE models that are around 60B in parameters so I can make use of all the VRAM? I feel like I’m stuck in a weird spot where using small models fees like a waste but I can’t really use larger models.
Any suggestions? Thanks
ambient_temp_xeno@reddit
I gave up on trying to get creative writing out of LLMs, but minimax m2.5 is worth trying. I didn't hate the stories it gave me, but whether they're better than gemma 4 31b I don't know.
Kahvana@reddit
Personally I'm getting quite decent results compared to previous local models like magistral or cydonia, and my system prompt for sillytavern isn't even optimal.
I reckon you can get much better results immediately by using koboldcpp with gemma4-31b, text completion (I'm using chat completion), enable dry + xtc and don't use presence/repeat/frequency penalty. top-nsigma might get you better results than top-p and top-k.
My not quite optimal system prompt below, might give some inspiration:
Obviously the writing is based on preference. "Avoid epanorthosis" prevents the "It's not X, It's Y" pattern, "Avoid empty platitudes" removes the cliche "Time heals all wounds" type of dialogues.
I might remove "avoid metaphors" as it might limit Gemma 4's natural speech too much, and forbidding empty platitudes at least get rid of the annoying ones.
ambient_temp_xeno@reddit
The best (lol) stuff I got out of LLMs was when I had early models cranked into almost/actually surreal states and then completely re-writing it or stealing the ideas it accidentally gave me.
mycall@reddit
I do miss GPT-3 word salad mmmm.
devildip@reddit
Gemma 4 31b is a dense model. The 26b variant is the MOE. I get 17 t/s on my 4060 laptop with 64gb of ram on the 8bit and 11 t/s on the 16 float. With your hardware, you should get much faster speeds.
Truthfully, if were looking at benchmarks, the only superior and realistically operable open source model is glm 5.
mateszhun@reddit
"Sadly" Qwen 3.6 27B and Gemma 4 31B dominates the open non-large models. You can find better for vision, but not for general or coding tasks. If you want speed check out Qwen 3.6 35B, that may satisfy you.
Last_Mastod0n@reddit
Just curious, what large model is considered better at vision?
mateszhun@reddit
Nemotron 3 Super, Ovis 2.6 80B
Last_Mastod0n@reddit
Wow nemotron didnt expect that. The nemotron models have continued to disappoint me compared to qwen and Gemma. I may have to see if I can generate a few responses just to test it (albeit slowly) with my 64gb ram
HypnoDaddy4You@reddit
I think a lot of the early nemotron builds have problems with support in the tool chain, based on things I've seen like repeating tokens and vowels with diacritics
craftogrammer@reddit
anything for 96GB RAM but 16GB VRAM? 5080?
FatheredPuma81@reddit
You could try Nemotron or Qwen3.5 122B. Pretty sure both are worse than Qwen3.6 35B and they'll be a lot slower though.
mateszhun@reddit
Same recommendation.
AsliReddington@reddit
Qwen, Nemotron3 & Gemma, your task should be picking between them
pmttyji@reddit
Q4/Q5/Q6/Q8 of Qwen3.6-35B-A3B, Qwen3.6-27B, Gemma-4-26B-A4B
MXFP4 of GPT-OSS-120B
Q4/Q5/Q6 of Qwen3-Coder-Next
Q4 of Qwen3.5-122B-A10B, Nemotron-3-Super-120B-A12B, Mistral-Small-4-119B-2603, GLM-4.5-Air
munkiemagik@reddit
With qwen3.6-27b performing as well as it does for its size, what kind of tasks do you find yourself preferring to use GPT-OSS-120B? Im interested in your (and u/kevin_1994) as I used to run OSS-120b a lot last year and havent fired it up in a while but for some reason I cant bring myself to wipe it off the system.
kevin_1994@reddit
Its really good at math. Even today, even compared to frontier. And, the alliterated versions combined with lack of sycophancy make it decent for bouncing ideas off of
trucekill@reddit
I just picked up a couple MI50s in hopes of running Qwen3.6 27B at Q4. Can you get max 256K context out of that setup?
havnar-@reddit
Qwen 3.6 barely fits my 64gb Mac Pro, so don’t kid yourself
NotARedditUser3@reddit
Gemma4 31b for creative writing feels a little silly. LFM2-24b-a2b will scream with speed for creative writing, and is still decently intelligent.
Karyo_Ten@reddit
Why silly, or is it a pun with SillyTavern?
FatheredPuma81@reddit
There is straight up nothing unless you want to run Qwen3.5 122B at 3 bit or Expert Offloaded to CPU. On Mi50's your best bet is Qwen3.6 35B at Q8. Qwen 27B and Gemma 4 31b at Q8 will be glacial even with MTP/Draft/Ngram(which ever combo gives the best results).
JaredsBored@reddit
Shockingly Mi50 with Q8 27B is pretty usable token generation. Q8 and Q4_1 are pretty much speed equivalent too because the Mi50 needs to up-cast q4 weights to process. With MTP I think 27B will be over 30tps generation. I'd check but I literally just replaced mine...
metmelo@reddit
I'm getting 24 t/s on mines (vs 18 before MTP).
JaredsBored@reddit
Are you quantizing KV and using flash attention? The Mi50 does not like Q8 or Q4 KV, really hurts performance.
FatheredPuma81@reddit
Well if the OP reads this there you go. Go for Q8 27B then with F16 KV Cache. Idk about MTP though good luck setting it up because I apparently can't.
kevin_1994@reddit
I'm in a similar boat: 128gb RAM, 48 GB VRAM (4090 and a 3090)
Right now nothing has been able to beat qwen3.6 27b q8. So my RAM is just chilling.
Other options are:
DinoAmino@reddit
Honest question: most all MoEs are reasoning models that are trained for solving complex problems - are they actually good at creative writing? Are they better than non-reasoning models?
Gwolf4@reddit
Isn't creative writing reasoning in it's core ?
PromptInjection_@reddit
Try
https://huggingface.co/jdopensource/JoyAI-LLM-Flash
49B.
VoiceApprehensive893@reddit
q8 gemma 4/qwen 3.6
I-will-allow-it@reddit
Get qwen3.6 both models in MTP with Q8. The MTP is a really good speed boost. GPT OSS 120B is still pretty good for some things, but not great at tools. Qwen3.5 122B is pretty good as well, but it’s hard to beat the 3.6 MTP models right now. Gemma 4 models are also good, and for some things are they beat the 3.6. Gemma is a lot better at making podcasts using open-notebook.